Chapter 2. Files

Introduction

Credit: Mark Lutz, author of Programming Python and Python Quick Referenc e, co-author of Learning Python

Behold the file—one of the first things that any reasonably pragmatic programmer reaches for in a programming language’s toolbox. Because processing external files is a very real, tangible task, the quality of file-processing interfaces is a good way to assess the practicality of a programming tool.

As the recipes in this chapter attest, Python shines in this task. Files in Python are supported in a variety of layers: from the built-in open function (a synonym for the standard file object type), to specialized tools in standard library modules such as os, to third-party utilities available on the Web. All told, Python’s arsenal of file tools provides several powerful ways to access files in your scripts.

File Basics

In Python, a file object is an instance of built-in type file. The built-in function open creates and returns a file object. The first argument, a string, specifies the file’s path (i.e., the filename preceded by an optional directory path). The second argument to open, also a string, specifies the mode in which to open the file. For example:

input = open('data', 'r')
output = open('/tmp/spam', 'w')

open accepts a file path in which directories and files are separated by slash characters (/), regardless of the proclivities of the underlying operating system. On systems that don’t use slashes, you can use a backslash character () instead, but there’s no real reason to do so. Backslashes are harder to fit nicely in string literals, since you have to double them up or use “raw” strings. If the file path argument does not include the file’s directory name, the file is assumed to reside in the current working directory (which is a disjoint concept from the Python module search path).

For the mode argument, use 'r' to read the file in text mode; this is the default value and is commonly omitted, so that open is called with just one argument. Other common modes are 'rb' to read the file in binary mode, 'w' to create and write to the file in text mode, and 'wb' to create and write to the file in binary mode. A variant of 'r' that is sometimes precious is 'rU', which tells Python to read the file in text mode with “universal newlines”: mode 'rU' can read text files independently of the line-termination convention the files are using, be it the Unix way, the Windows way, or even the (old) Mac way. (Mac OS X today is a Unix for all intents and purposes, but releases of Mac OS 9 and earlier, just a few years ago, were quite different.)

The distinction between text mode and binary mode is important on non-Unix-like platforms because of the line-termination characters used on these systems. When you open a file in binary mode, Python knows that it doesn’t need to worry about line-termination characters; it just moves bytes between the file and in-memory strings without any kind of translation. When you open a file in text mode on a non-Unix-like system, however, Python knows it must translate between the ' ' line-termination characters used in strings and whatever the current platform uses in the file itself. All of your Python code can always rely on ' ' as the line-termination character, as long as you properly indicate text or binary mode when you open the file.

Once you have a file object, you perform all file I/O by calling methods of this object, as we’ll discuss in a moment. When you’re done with the file, you should finish by calling the close method on the object, to close the connection to the file:

input.close( )

In short scripts, people often omit this step, as Python automatically closes the file when a file object is reclaimed during garbage collection (which in mainstream Python means the file is closed just about at once, although other important Python implementations, such as Jython and IronPython, have other, more relaxed garbage-collection strategies). Nevertheless, it is good programming practice to close your files as soon as possible, and it is especially a good idea in larger programs, which otherwise may be at more risk of having excessive numbers of uselessly open files lying about. Note that try/finally is particularly well suited to ensuring that a file gets closed, even when a function terminates due to an uncaught exception.

To write to a file, use the write method:

output.write(s)

where s is a string. Think of s as a string of characters if output is open for text-mode writing, and as a string of bytes if output is open for binary-mode writing. Files have other writing-related methods, such as flush, to send any data being buffered, and writelines, to write a sequence of strings in a single call. However, write is by far the most commonly used method.

Reading from a file is more common than writing to a file, and more issues are involved, so file objects have more reading methods than writing ones. The readline method reads and returns the next line from a text file. Consider the following loop:

while True:
    line = input.readline( )
    if not line: break
    process(line)

This was once idiomatic Python but it is no longer the best way to read and process all of the lines from a file. Another dated alternative is to use the readlines method, which reads the whole file and returns a list of lines:

for line in input.readlines( ):
    process(line)

readlines is useful only for files that fit comfortably in physical memory. If the file is truly huge, readlines can fail or at least slow things down quite drastically (virtual memory fills up and the operating system has to start copying parts of physical memory to disk). In today’s Python, just loop on the file object itself to get a line at a time with excellent memory and performance characteristics:

for line in input:
    process(line)

Of course, you don’t always want to read a file line by line. You may instead want to read some or all of the bytes in the file, particularly if you’ve opened the file for binary-mode reading, where lines are unlikely to be an applicable concept. In this case, you can use the read method. When called without arguments, read reads and returns all the remaining bytes from the file. When read is called with an integer argument N, it reads and returns the next N bytes (or all the remaining bytes, if less than N bytes remain). Other methods worth mentioning are seek and tell, which support random access to files. These methods are normally used with binary files made up of fixed-length records.

Portability and Flexibility

On the surface, Python’s file support is straightforward. However, before you peruse the code in this chapter, I want to underscore two aspects of Python’s file support: code portability and interface flexibility.

Keep in mind that most file interfaces in Python are fully portable across platform boundaries. It would be difficult to overstate the importance of this feature. A Python script that searches all files in a “directory” tree for a bit of text, for example, can be freely moved from platform to platform without source-code changes: just copy the script’s source file to the new target machine. I do it all the time—so much so that I can happily stay out of operating system wars. With Python’s portability, the underlying platform is almost irrelevant.

Also, it has always struck me that Python’s file-processing interfaces are not restricted to real, physical files. In fact, most file tools work with any kind of object that exposes the same interface as a real file object. Thus, a file reader cares only about read methods, and a file writer cares only about write methods. As long as the target object implements the expected protocol, all goes well.

For example, suppose you have written a general file-processing function such as the following, meant to apply a passed-in function to each line of an input file:

def scanner(fileobject, linehandler):
    for line in fileobject:
        linehandler(line)

If you code this function in a module file and drop that file into a “directory” that’s on your Python search path (sys.path), you can use it any time you need to scan a text file line by line, now or in the future. To illustrate, here is a client script that simply prints the first word of each line:

from myutils import scanner
def firstword(line):
    print line.split( )[0]
file = open('data')
scanner(file, firstword)

So far, so good; we’ve just coded a small, reusable software component. But notice that there are no type declarations in the scanner function, only an interface constraint—any object that is iterable line by line will do. For instance, suppose you later want to provide canned test input from a string object, instead of using a real, physical file. The standard StringIO module, and the equivalent but faster cStringIO, provide the appropriate wrapping and interface forgery:

from cStringIO import StringIO
from myutils import scanner
def firstword(line): print line.split( )[0]
string = StringIO('one
two xxx
three
')
scanner(string, firstword)

StringIO objects are plug-and-play compatible with file objects, so scanner takes its three lines of text from an in-memory string object, rather than a true external file. You don’t need to change the scanner to make this work—just pass it the right kind of object. For more generality, you can even use a class to implement the expected interface instead:

class MyStream(object):
    def _ _iter_ _(self):
        # grab and return text from wherever
        return iter(['a
', 'b c d
'])
from myutils import scanner
def firstword(line):
    print line.split( )[0]
object = MyStream( )
scanner(object, firstword)

This time, as scanner attempts to read the file, it really calls out to the _ _iter_ _ method you’ve coded in your class. In practice, such a method might use other Python standard tools to grab text from a variety of sources: an interactive user, a popup GUI input box, a shelve object, an SQL database, an XML or HTML page, a network socket, and so on. The point is that scanner doesn’t know or care what type of object is implementing the interface it expects, or what that interface actually does.

Object-oriented programmers know this deliberate naiveté as polymorphism. The type of the object being processed determines what an operation, such as the for-loop iteration in scanner, actually does. Everywhere in Python, object interfaces, rather than specific data types, are the unit of coupling. The practical effect is that functions are often applicable to a much broader range of problems than you might expect. This is especially true if you have a background in statically typed languages such as C or C++. It is almost as if we get C++ templates for free in Python. Code has an innate flexibility that is a by-product of Python’s strong but dynamic typing.

Of course, code portability and flexibility run rampant in Python development and are not really confined to file interfaces. Both are features of the language that are simply inherited by file-processing scripts. Other Python benefits, such as its easy scriptability and code readability, are also key assets when it comes time to change file-processing programs. But rather than extolling all of Python’s virtues here, I’ll simply defer to the wonderful recipes in this chapter and this book at large for more details. Enjoy!

2.1. Reading from a File

Credit: Luther Blissett

Problem

You want to read text or data from a file.

Solution

Here’s the most convenient way to read all of the file’s contents at once into one long string:

all_the_text = open('thefile.txt').read( )    # all text from a text file
all_the_data = open('abinfile', 'rb').read( ) # all data from a binary file

However, it is safer to bind the file object to a name, so that you can call close on it as soon as you’re done, to avoid ending up with open files hanging around. For example, for a text file:

file_object = open('thefile.txt')
try:
    all_the_text = file_object.read( )
finally:
    file_object.close( )

You don’t necessarily have to use the try/finally statement here, but it’s a good idea to use it, because it ensures the file gets closed even when an error occurs during reading.

The simplest, fastest, and most Pythonic way to read a text file’s contents at once as a list of strings, one per line, is:

list_of_all_the_lines = file_object.readlines( )

This leaves a ' ' at the end of each line; if you don’t want that, you have alternatives, such as:

list_of_all_the_lines = file_object.read( ).splitlines( )
list_of_all_the_lines = file_object.read( ).split('
')
list_of_all_the_lines = [L.rstrip('
') for L in file_object]

The simplest and fastest way to process a text file one line at a time is simply to loop on the file object with a for statement:

for line in file_object:process line

This approach also leaves a ' ' at the end of each line; you may remove it by starting the for loop’s body with:

    line = line.rstrip('
')

or even, when you’re OK with getting rid of trailing whitespace from each line (not just a trailing ' '), the generally handier:

    line = line.rstrip( )

Discussion

Unless the file you’re reading is truly huge, slurping it all into memory in one gulp is often fastest and most convenient for any further processing. The built-in function open creates a Python file object (alternatively, you can equivalently call the built-in type file). You call the read method on that object to get all of the contents (whether text or binary) as a single long string. If the contents are text, you may choose to immediately split that string into a list of lines with the split method or the specialized splitlines method. Since splitting into lines is frequently needed, you may also call readlines directly on the file object for faster, more convenient operation.

You can also loop directly on the file object, or pass it to callables that require an iterable, such as list or max—when thus treated as an iterable, a file object open for reading has the file’s text lines as the iteration items (therefore, this should be done for text files only). This kind of line-by-line iteration is cheap in terms of memory consumption and fairly speedy too.

On Unix and Unix-like systems, such as Linux, Mac OS X, and other BSD variants, there is no real distinction between text files and binary data files. On Windows and very old Macintosh systems, however, line terminators in text files are encoded, not with the standard ' ' separator, but with ' ' and ' ', respectively. Python translates these line-termination characters into ' ' on your behalf. This means that you need to tell Python when you open a binary file, so that it won’t perform such translation. To do so, use 'rb' as the second argument to open. This is innocuous even on Unix-like platforms, and it’s a good habit to distinguish binary files from text files even there, although it’s not mandatory in that case. Such good habits will make your programs more immediately understandable, as well as more compatible with different platforms.

If you’re unsure about which line-termination convention a certain text file might be using, use 'rU' as the second argument to open, requesting universal endline translation. This lets you freely interchange text files among Windows, Unix (including Mac OS X), and old Macintosh systems, without worries: all kinds of line-ending conventions get mapped to ' ', whatever platform your code is running on.

You can call methods such as read directly on the file object produced by the open function, as shown in the first snippet of the solution. When you do so, you no longer have a reference to the file object as soon as the reading operation finishes. In practice, Python notices the lack of a reference at once, and immediately closes the file. However, it is better to bind a name to the result of open, so that you can call close yourself explicitly when you are done with the file. This ensures that the file stays open for as short a time as possible, even on platforms such as Jython, IronPython, and other hypothetical future versions of Python, on which more advanced garbage-collection mechanisms might delay the automatic closing that the current version of C-based Python performs at once. To ensure that a file object is closed even if errors happen during its processing, the most solid and prudent approach is to use the try/finally statement:

file_object = open('thefile.txt')
try:
    for line in file_object:process line
finally:
    file_object.close( )

Be careful not to place the call to open inside the try clause of this try/finally statement (a rather common error among beginners). If an error occurs during the opening, there is nothing to close, and besides, nothing gets bound to name file_object, so you definitely don’t want to call file_object.close()!

If you choose to read the file a little at a time, rather than all at once, the idioms are different. Here’s one way to read a binary file 100 bytes at a time, until you reach the end of the file:

file_object = open('abinfile', 'rb')
try:
    while True:
        chunk = file_object.read(100)
        if not chunk:
            break
        do_something_with(chunk)
finally:
    file_object.close( )

Passing an argument N to the read method ensures that read will read only the next N bytes (or fewer, if the file is closer to the end). read returns the empty string when it reaches the end of the file. Complicated loops are best encapsulated as reusable generators. In this case, we can encapsulate the logic only partially, because a generator’s yield keyword is not allowed in the try clause of a try/finally statement. Giving up on the assurance of file closing afforded by try/finally, we can therefore settle for:

def read_file_by_chunks(filename, chunksize=100):
    file_object = open(filename, 'rb')
    while True:
        chunk = file_object.read(chunksize)
        if not chunk:
            break
        yield chunk
    file_object.close( )

Once this read_file_by_chunks generator is available, your application code to read and process a binary file by fixed-size chunks becomes extremely simple:

for chunk in read_file_by_chunks('abinfile'):
    do_something_with(chunk)

Reading a text file one line at a time is a frequent task. Just loop on the file object, as in:

for line in open('thefile.txt', 'rU'):
    do_something_with(line)

Here, too, in order to be 100% certain that no uselessly open file object will ever be left just hanging around, you may want to code this snippet in a more rigorously correct and prudent way:

file_object = open('thefile.txt', 'rU'):
try:
    for line in file_object:
        do_something_with(line)
finally:
    file_object.close( )

See Also

Recipe 2.2; documentation for the open built-in function and file objects in the Library Reference and Python in a Nutshell.

2.2. Writing to a File

Credit: Luther Blissett

Problem

You want to write text or data to a file.

Solution

Here is the most convenient way to write one long string to a file:

open('thefile.txt', 'w').write(all_the_text)  # text to a text file
open('abinfile', 'wb').write(all_the_data)    # data to a binary file

However, it is safer to bind the file object to a name, so that you can call close on the file object as soon as you’re done. For example, for a text file:

file_object = open('thefile.txt', 'w')
file_object.write(all_the_text)
file_object.close( )

Often, the data you want to write is not in one big string, but in a list (or other sequence) of strings. In this case, you should use the writelines method (which, despite its name, is not limited to lines and works just as well with binary data as with text files!):

file_object.writelines(list_of_text_strings)
open('abinfile', 'wb').writelines(list_of_data_strings)

Calling writelines is much faster than the alternatives of joining the strings into one big string (e.g., with ''.join) and then calling write, or calling write repeatedly in a loop.

Discussion

To create a file object for writing, you must always pass a second argument to open (or file)—either 'w' to write textual data or 'wb' to write binary data. The same considerations detailed previously in Recipe 2.1 apply here, except that calling close explicitly is even more advisable when you’re writing to a file rather than reading from it. Only by closing the file can you be reasonably sure that the data is actually on the disk and not still residing in some temporary buffer in memory.

Writing a file a little at a time is even more common than reading a file a little at a time. You can just call write and/or writelines repeatedly, as each string or sequence of strings to write becomes ready. Each write operation appends data at the end of the file, after all the previously written data. When you’re done, call the close method on the file object. If all the data is available at once, a single writelines call is faster and simpler. However, if the data becomes available a little at a time, it’s better to call write as the data comes, than to build up a temporary list of pieces (e.g., with append) just in order to be able to write it all at once in the end with writelines. Reading and writing are quite different, with respect to the performance and convenience implications of operating “in bulk” versus operating a little at a time.

When you open a file for writing with option 'w' (or 'wb'), any data that might already have been in the file is immediately destroyed; even if you close the file object immediately after opening it, you still end up with an empty file on the disk. If you want the data you’re writing to be appended to the previous contents of the file, open the file with option 'a' (or 'ab') instead. More advanced options allow both reading and writing on the same open file object—in particular, see Recipe 2.8 for option 'r+b', which, in practice, is the only frequently used one out of all the advanced option strings.

See Also

Recipe 2.1; Recipe 2.8; documentation for the open built-in function and file objects in the Library Reference and Python in a Nutshell.

2.3. Searching and Replacing Text in a File

Credit: Jeff Bauer, Adam Krieg

Problem

You need to change one string into another throughout a file.

Solution

String substitution is most simply performed by the replace method of string objects. The work here is to support reading from a specified file (or standard input) and writing to a specified file (or standard output):

#!/usr/bin/env python
import os, sys
nargs = len(sys.argv)
if not 3 <= nargs <= 5:
    print "usage: %s search_text replace_text [infile [outfile]]" % 
        os.path.basename(sys.argv[0])
else:
    stext = sys.argv[1]
    rtext = sys.argv[2]
    input_file = sys.stdin
    output_file = sys.stdout
    if nargs > 3:
        input_file = open(sys.argv[3])
    if nargs > 4:
        output_file = open(sys.argv[4], 'w')for s in input_file:
        output_file.write(s.replace(stext, rtext))
    output.close( )
    input.close( )

Discussion

This recipe is really simple, but that’s what beautiful about it—why do complicated stuff when simple stuff suffices? As indicated by the leading “shebang” line, the recipe is a simple main script, meaning a script meant to be run directly at a shell command prompt, as opposed to a module meant to be imported from elsewhere. The script looks at its arguments to determine the search text, the replacement text, the input file (defaulting to standard input), and the output file (defaulting to standard output). Then, it loops over each line of the input file, writing to the output file a copy of the line with the substitution performed on it. That’s all! For accuracy, the script closes both files at the end.

As long as an input file fits comfortably in memory in two copies (one before and one after the replacement, since strings are immutable), we could, with an increase in speed, operate on the entire input file’s contents at once instead of looping. With today’s low-end PCs typically containing at least 256 MB of memory, handling files of up to about 100 MB should not be a problem, and few text files are bigger than that. It suffices to replace the for loop with one single statement:

output_file.write(input_file.read( ).replace(stext, rtext))

As you can see, that’s even simpler than the loop used in the recipe.

See Also

Documentation for the open built-in function, file objects, and strings’ replace method in the Library Reference and Python in a Nutshell.

2.4. Reading a Specific Line from a File

Credit: Luther Blissett

Problem

You want to read from a text file a single line, given the line number.

Solution

The standard Python library linecache module makes this task a snap:

import linecache
theline = linecache.getline(thefilepath, desired_line_number)

Discussion

The standard linecache module is usually the optimal Python solution for this task. linecache is particularly useful when you have to perform this task repeatedly for several lines in a file, since linecache caches information to avoid uselessly repeating work. When you know that you won’t be needing any more lines from the cache for a while, call the module’s clearcache function to free the memory used for the cache. You can also use checkcache if the file may have changed on disk and you must make sure you are getting the updated version.

linecache reads and caches all of the text file whose name you pass to it, so, if it’s a very large file and you need only one of its lines, linecache may be doing more work than is strictly necessary. Should this happen to be a bottleneck for your program, you may get an increase in speed by coding an explicit loop, encapsulated within a function, such as:

def getline(thefilepath, desired_line_number):
    if desired_line_number < 1: return ''
    for current_line_number, line in enumerate(open(thefilepath, 'rU')):
        if current_line_number == desired_line_number-1: return line
    return ''

The only detail requiring attention is that enumerate counts from 0, so, since we assume the desired_line_number argument counts from 1, we need the -1 in the == comparison.

See Also

Documentation for the linecache module in the Library Reference and Python in a Nutshell; Perl Cookbook recipe 8.8.

2.5. Counting Lines in a File

Credit: Luther Blissett

Problem

You need to compute the number of lines in a file.

Solution

The simplest approach for reasonably sized files is to read the file as a list of lines, so that the count of lines is the length of the list. If the file’s path is in a string bound to a variable named thefilepath, all the code you need to implement this approach is:

count = len(open(thefilepath, 'rU').readlines( ))

For a truly huge file, however, this simple approach may be very slow or even fail to work. If you have to worry about humongous files, a loop on the file always works:

count = -1
for count, line in enumerate(open(thefilepath, 'rU')):
    pass
count += 1

A tricky alternative, potentially faster for truly humongous files, for when the line terminator is ' ' (or has ' ' as a substring, as happens on Windows):

count = 0
thefile = open(thefilepath, 'rb')
while True:
    buffer = thefile.read(8192*1024)
    if not buffer:
        break
    count += buffer.count('
')
thefile.close( )

The 'rb' argument to open is necessary if you’re after speed—without that argument, this snippet might be very slow on Windows.

Discussion

When an external program counts a file’s lines, such as wc -l on Unix-like platforms, you can of course choose to use that (e.g., via os.popen). However, it’s generally simpler, faster, and more portable to do the line-counting in your own program. You can rely on almost all text files having a reasonable size, so that reading the whole file into memory at once is feasible. For all such normal files, the len of the result of readlines gives you the count of lines in the simplest way.

If the file is larger than available memory (say, a few hundred megabytes on a typical PC today), the simplest solution can become unacceptably slow, as the operating system struggles to fit the file’s contents into virtual memory. It may even fail, when swap space is exhausted and virtual memory can’t help any more. On a typical PC, with 256MB RAM and virtually unlimited disk space, you should still expect serious problems when you try to read into memory files above, say, 1 or 2 GB, depending on your operating system. (Some operating systems are much more fragile than others in handling virtual-memory issues under such overly stressed load conditions.) In this case, looping on the file object, as shown in this recipe’s Solution, is better. The enumerate built-in keeps the line count without your code having to do it explicitly.

Counting line-termination characters while reading the file by bytes in reasonably sized chunks is the key idea in the third approach. It’s probably the least immediately intuitive, and it’s not perfectly cross-platform, but you might hope that it’s fastest (e.g., when compared with recipe 8.2 in the Perl Cookbook).

However, in most cases, performance doesn’t really matter all that much. When it does matter, the time-sink part of your program might not be what your intuition tells you it is, so you should never trust your intuition in this matter—instead, always benchmark and measure. For example, consider a typical Unix syslog file of middling size, a bit over 18 MB of text in 230,000 lines:

[situ@tioni nuc]$ wc nuc
 231581 2312730 18508908 nuc

And consider the following testing-and-benchmark framework script, bench.py:

import time
def timeo(fun, n=10):
    start = time.clock( )
    for i in xrange(n): fun( )
    stend = time.clock( )
    thetime = stend-start
    return fun._ _name_ _, thetime
import os
def linecount_w( ):
    return int(os.popen('wc -l nuc').read( ).split( )[0])
def linecount_1( ):
    return len(open('nuc').readlines( ))
def linecount_2( ):
    count = -1
    for count, line in enumerate(open('nuc')): pass
    return count+1
def linecount_3( ):
    count = 0
    thefile = open('nuc', 'rb')
    while True:
        buffer = thefile.read(65536)
        if not buffer: break
        count += buffer.count('
')
    return count
for f in linecount_w, linecount_1, linecount_2, linecount_3:
    print f._ _name_ _, f( )
for f in linecount_1, linecount_2, linecount_3:
    print "%s: %.2f"%timeo(f)

First, I print the line-counts obtained by all methods, thus ensuring that no anomaly or error has occurred (counting tasks are notoriously prone to off-by-one errors). Then, I run each alternative 10 times, under the control of the timing function timeo, and look at the results. Here they are, on the old but reliable machine I measured them on:

[situ@tioni nuc]$ python -O bench.pylinecount_w 231581
               linecount_1 231581
               linecount_2 231581
               linecount_3 231581
               linecount_1: 4.84
               linecount_2: 4.54
               linecount_3: 5.02

As you can see, the performance differences hardly matter: your users will never even notice a difference of 10% or so in one auxiliary task. However, the fastest approach (for my particular circumstances, on an old but reliable PC running a popular Linux distribution, and for this specific benchmark) is the humble loop-on-every-line technique, while the slowest one is the fancy, ambitious technique that counts line terminators by chunks. In practice, unless I had to worry about files of many hundreds of megabytes, I’d always use the simplest approach (i.e., the first one presented in this recipe).

Measuring the exact performance of code snippets (rather than blindly using complicated approaches in the hope that they’ll be faster) is very important—so important, indeed, that the Python Standard Library includes a module, timeit, specifically designed for such measurement tasks. I suggest you use timeit, rather than coding your own little benchmarks as I have done here. The benchmark I just showed you is one I’ve had around for years, since well before timeit appeared in the standard Python library, so I think I can be forgiven for not using timeit in this specific case!

See Also

The Library Reference and Python in a Nutshell sections on file objects, the enumerate built-in, os.popen, and the time and timeit modules; Perl Cookbook recipe 8.2.

2.6. Processing Every Word in a File

Credit: Luther Blissett

Problem

You need to do something with each and every word in a file.

Solution

This task is best handled by two nested loops, one on lines and another on the words in each line:

for line in open(thefilepath):
    for word in line.split( ):
        dosomethingwith(word)

The nested for statement’s header implicitly defines words as sequences of nonspaces separated by sequences of spaces (just as the Unix program wc does). For other definitions of words, you can use regular expressions. For example:

import re
re_word = re.compile(r"[w'-]+")
for line in open(thefilepath):
    for word in re_word.finditer(line):
        dosomethingwith(word.group(0))

In this case, a word is defined as a maximal sequence of alphanumerics, hyphens, and apostrophes.

Discussion

If you want to use other definitions of words, you will obviously need different regular expressions. The outer loop, on all lines in the file, won’t change.

It’s often a good idea to wrap iterations as iterator objects, and this kind of wrapping is most commonly and conveniently obtained by coding simple generators:

def words_of_file(thefilepath, line_to_words=str.split):
    the_file = open(thefilepath):
    for line in the_file:
        for word in line_to_words(line):
            yield word
    the_file.close( )
for word in words_of_file(thefilepath):
    dosomethingwith(word)

This approach lets you separate, cleanly and effectively, two different concerns: how to iterate over all items (in this case, words in a file) and what to do with each item in the iteration. Once you have cleanly encapsulated iteration concerns in an iterator object (often, as here, a generator), most of your uses of iteration become simple for statements. You can often reuse the iterator in many spots in your program, and if maintenance is ever needed, you can perform that maintenance in just one place—the definition of the iterator—rather than having to hunt for all uses. The advantages are thus very similar to those you obtain in any programming language by appropriately defining and using functions, rather than copying and pasting pieces of code all over the place. With Python’s iterators, you can get these reuse advantages for all of your looping-control structures, too.

We’ve taken the opportunity afforded by the refactoring of the loop into a generator to perform two minor enhancements—ensuring the file is explicitly closed, which is always a good idea, and generalizing the way each line is split into words (defaulting to the split method of string objects, but leaving a door open to more generality). For example, when we need words as defined by a regular expression, we can code another wrapper on top of words_of_file thanks to this “hook”:

import re
def words_by_re(thefilepath, repattern=r"[w'-]+"):
    wre = re.compile(repattern)
    def line_to_words(line):
        for mo in wre.finditer(line):
            return mo.group(0)
    return words_of_file(thefilepath, line_to_words)

Here, too, we supply a reasonable default for the regular expression pattern defining a word but still make it easy to pass a different value in those cases in which different definitions are necessary. Excessive generalization is a pernicious temptation, but a little tasteful generalization suggested by experience will most often amply repay the modest effort it requires. Having a function accept an optional argument, while providing the most likely value for the argument as the default value, is among the simplest and handiest ways to implement this modest and often worthwhile kind of generalization.

See Also

Chapter 19 for more on iterators and generators; Library Reference and Python in a Nutshell on file objects and the re module; Perl Cookbook recipe 8.3.

2.7. Using Random-Access Input/Output

Credit: Luther Blissett

Problem

You want to read a binary record from somewhere inside a large file of fixed-length records, without reading a record at a time to get there.

Solution

The byte offset of the start of a record in the file is the size of a record, in bytes, multiplied by the progressive number of the record (counting from 0). So, you can just seek right to the proper spot, then read the data. For example, to read the seventh record from a binary file where each record is 48 bytes long:

thefile = open('somebinfile', 'rb')
record_size = 48
record_number = 6
thefile.seek(record_size * record_number)
buffer = thefile.read(record_size)

Note that the record_number of the seventh record is 6: record numbers count from zero!

Discussion

This approach works only on files (generally binary ones) defined in terms of records that are all the same fixed size in bytes; it doesn’t work on normal text files. For clarity, the recipe shows the file being opened for reading as a binary file by passing 'rb' as the second argument to open, just before the seek. As long as the file object is open for reading as a binary file, you can perform as many seek and read operations as you need, before eventually closing the file again—you don’t necessarily open the file just before performing a seek on it.

See Also

The section of the Library Reference and Python in a Nutshell on file objects; Perl Cookbook recipe 8.12.

2.8. Updating a Random-Access File

Credit: Luther Blissett

Problem

You want to read a binary record from somewhere inside a large file of fixed-length records, change some or all of the values of the record’s fields, and write the record back.

Solution

Read the record, unpack it, perform whatever computations you need for the update, pack the fields back into the record, seek to the start of the record again, write it back. Phew. Faster to code than to say:

import struct
format_string = '8l'                # e.g., say a record is 8 4-byte integers
thefile = open('somebinfile', 'r+b')
record_size = struct.calcsize(format_string)
record_number = 6
thefile.seek(record_size * record_number)
buffer = thefile.read(record_size)
fields = list(struct.unpack(format_string, buffer))
# Perform computations, suitably modifying fields, then:
buffer = struct.pack(format_string, *fields)
thefile.seek(record_size * record_number)
thefile.write(buffer)
thefile.close( )

Discussion

This approach works only on files (generally binary ones) defined in terms of records that are all the same, fixed size; it doesn’t work on normal text files. Furthermore, the size of each record must be that defined by a struct format string, as shown in the recipe’s code. A typical format string, for example, might be '8l', to specify that each record is made up of eight four-byte integers, each to be interpreted as a signed value and unpacked into a Python int. In this case, the fields variable in the recipe would be bound to a list of eight ints. Note that struct.unpack returns a tuple. Because tuples are immutable, the computation would have to rebind the entire fields variable. A list is mutable, so each field can be rebound as needed. Thus, for convenience, we explicitly ask for a list when we bind fields. Make sure, however, not to alter the length of the list. In this case, it needs to remain composed of exactly eight integers, or the struct.pack call will raise an exception when we call it with a format_string of '8l‘. Also, this recipe is not suitable when working with records that are not all of the same, unchanging length.

To seek back to the start of the record, instead of using the record_size*record_number offset again, you may choose to do a relative seek:

thefile.seek(-record_size, 1)

The second argument to the seek method (1) tells the file object to seek relative to the current position (here, so many bytes back, because we used a negative number as the first argument). seek’s default is to seek to an absolute offset within the file (i.e., from the start of the file). You can also explicitly request this default behavior by calling seek with a second argument of 0.

You don’t need to open the file just before you do the first seek, nor do you need to close it right after the write. Once you have a file object that is correctly opened (i.e., for updating and as a binary rather than a text file), you can perform as many updates on the file as you want before closing the file again. These calls are shown here to emphasize the proper technique for opening a file for random-access updates and the importance of closing a file when you are done with it.

The file needs to be opened for updating (i.e., to allow both reading and writing). That’s what the 'r+b' argument to open means: open for reading and writing, but do not implicitly perform any transformations on the file’s contents because the file is a binary one. (The 'b' part is unnecessary but still recommended for clarity on Unix and Unix-like systems. However, it’s absolutely crucial on other platforms, such as Windows.) If you’re creating the binary file from scratch, but you still want to be able to go back, reread, and update some records without closing and reopening the file, you can use a second argument of 'w+b' instead. However, I have never witnessed this strange combination of requirements; binary files are normally first created (by opening them with 'wb', writing data, and closing the file) and later reopened for updating with 'r+b‘.

While this approach is normally useful only on a file whose records are all the same size, another, more advanced possibility exists: a separate “index file” that provides the offset and length of each record inside the “data file”. Such indexed sequential access approaches aren’t much in fashion any more, but they used to be very important. Nowadays, one meets just about only text files (of many kinds, more and more often XML ones), databases, and occasional binary files with fixed-length records. Still, if you do need to access an indexed sequential binary file, the code is quite similar to that shown in this recipe, except that you must obtain the record_size and the offset argument to pass to thefile.seek by reading them from the index file, rather than computing them yourself as shown in this recipe’s Solution.

See Also

The sections of the Library Reference and Python in a Nutshell on file objects and the struct module; Perl Cookbook recipe 8.13.

2.9. Reading Data from zip Files

Credit: Paul Prescod, Alex Martelli

Problem

You want to directly examine some or all of the files contained in an archive in zip format, without expanding them on disk.

Solution

zip files are a popular, cross-platform way of archiving files. The Python Standard Library comes with a zipfile module to access such files easily:

import zipfile
z = zipfile.ZipFile("zipfile.zip", "r")
for filename in z.namelist( ):
    print 'File:', filename,
    bytes = z.read(filename)
    print 'has', len(bytes), 'bytes'

Discussion

Python can work directly with data in zip files. You can look at the list of items in the archive’s directory and work with the “data file"s themselves. This recipe is a snippet that lists all of the names and content lengths of the files included in the zip archive zipfile.zip.

The zipfile module does not currently handle multidisk zip files nor zip files with appended comments. Take care to use r as the flag argument, not rb, which might seem more natural (e.g., on Windows). With ZipFile, the flag is not used the same way when opening a file, and rb is not recognized. The r flag handles the inherently binary nature of all zip files on all platforms.

When a zip file contains some Python modules (meaning .py or preferably .pyc files), possibly in addition to other (data) files, you can add the file’s path to Python’s sys.path and then use the import statement to import modules from the zip file. Here’s a toy, self-contained, purely demonstrative example that creates such a zip file on the fly, imports a module from it, then removes it—all just to show you how it’s done:

import zipfile, tempfile, os, sys
handle, filename = tempfile.mkstemp('.zip')
os.close(handle)
z = zipfile.ZipFile(filename, 'w')
z.writestr('hello.py', 'def f( ): return "hello world from "+_ _file_ _
')
z.close( )
sys.path.insert(0, filename)
import hello
print hello.f( )
os.unlink(filename)

Running this script emits something like:

hello world from /tmp/tmpESVzeY.zip/hello.py

Besides illustrating Python’s ability to import from a zip file, this snippet also shows how to make (and later remove) a temporary file, and how to use the writestr method to add a member to a zip file without placing that member into a disk file first.

Note that the path to the zip file from which you import is treated somewhat like a directory. (In this specific example run, that path is /tmp/tmpESVzeY.zip, but of course, since we’re dealing with a temporary file, the exact value of the path can change at each run, depending also on your platform.) In particular, the _ _file_ _ global variable, within the module hello, which is imported from the zip file, has a value of /tmp/tmpESVzeY.zip/hello.py—a pseudo-path, made up of the zip file’s path seen as a “directory” followed by the relative path of hello.py within the zip file. If you import from a zip file a module that computes paths relative to itself in order to get to data files, you need to adapt the module to this effect, because you cannot just open such a “pseudo-path” to get a file object: rather, to read or write files inside a zip file, you must use functions from standard library module zipfile, as shown in the solution.

For more information about importing modules from a zip file, see Recipe 16.12. While that recipe is Unix-specific, the information in the recipe’s Discussion about importing from zip files is also valid for Windows.

See Also

Documentation for the zipfile module in the Library Reference and Python in a Nutshell; modules tempfile, os, sys; for archiving a tree of files, see Recipe 2.11; for more information about importing modules from a zip file, Recipe 16.12.

2.10. Handling a zip File Inside a String

Credit: Indyana Jones

Problem

Your program receives a zip file as a string of bytes in memory, and you need to read the information in this zip file.

Solution

Solving this kind of problem is exactly what standard library module cStringIO is for:

import cStringIO, zipfile
class ZipString(ZipFile):
    def _ _init_ _(self, datastring):
        ZipFile._ _init_ _(self, cStringIO.StringIO(datastring))

Discussion

I often find myself faced with this task—for example, zip files coming from BLOB fields in a database or ones received from a network connection. I used to save such binary data to a temporary file, then open the file with the standard library module zipfile. Of course, I had to ensure I deleted the temporary file when I was done. Then I thought of using the standard library module cStringIO for the purpose . . . and never looked back.

Module cStringIO lets you wrap a string of bytes so it can be accessed as a file object. You can also do things the other way around, writing into a cStringIO.StringIO instance as if it were a file object, and eventually recovering its contents as a string of bytes. Most Python modules that take file objects don’t check whether you’re passing an actual file—rather, any file-like object will do; the module’s code just calls on the object whatever file methods it needs. As long as the object supplies those methods and responds correctly when they’re called, everything just works. This demonstrates the awesome power of signature-based polymorphism and hopefully teaches why you should almost never type-test (utter such horrors as if type(x) is y, or even just the lesser horror if isinstance(x, y)) in your own code! A few low-level modules, such as marshal, are unfortunately adamant about using “true” files, but zipfile isn’t, and this recipe shows how simple it makes your life!

If you are using a version of Python that is different from the mainstream C-coded one, known as “CPython”, you may not find module cStringIO in the standard library. The leading c in the name of the module indicates that it’s a C-specific module, optimized for speed but not guaranteed to be in the standard library for other compliant Python implementations. Several such alternative implementations include both production-quality ones (such as Jython, which is coded in Java and runs on a JVM) and experimental ones (such as pypy, which is coded in Python and generates machine code, and IronPython, which is coded in C# and runs on Microsoft’s .NET CLR). Not to worry: the Python Standard Library always includes module StringIO, which is coded in pure Python (and thus is usable from any compliant implementation of Python), and implements the same functionality as module cStringIO (albeit not quite as fast, at least on the mainstream CPython implementation). You just need to alter your import statement a bit to make sure you get cStringIO when available and StringIO otherwise. For example, this recipe might become:

import zipfile
try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO
class ZipString(ZipFile):
    def _ _init_ _(self, datastring):
        ZipFile._ _init_ _(self, StringIO(datastring))

With this modification, the recipe becomes useful in Jython, and other, alternative implementations.

See Also

Modules zipfile and cStringIO in the Library Reference and Python in a Nutshell; Jython is at http://www.jython.org/; pypy is at http://codespeak.net/pypy/; IronPython is at http://ironpython.com/.

2.11. Archiving a Tree of Files into a Compressed tar File

Credit: Ed Gordon, Ravi Teja Bhupatiraju

Problem

You need to archive all of the files and folders in a subtree into a tar archive file, compressing the data with either the popular gzip approach or the higher-compressing bzip2 approach.

Solution

The Python Standard Library’s tarfile module directly supports either kind of compression: you just need to specify the kind of compression you require, as part of the option string that you pass when you call tarfile.TarFile.open to create the archive file. For example:

import tarfile, os
def make_tar(folder_to_backup, dest_folder, compression='bz2'):
    if compression:
        dest_ext = '.' + compression
    else:
        dest_ext = ''
    arcname = os.path.basename(folder_to_backup)
    dest_name = '%s.tar%s' % (arcname, dest_ext)
    dest_path = os.path.join(dest_folder, dest_name)
    if compression:
        dest_cmp = ':' + compression
    else:
        dest_cmp = ''
    out = tarfile.TarFile.open(dest_path, 'w'+dest_cmp)
    out.add(folder_to_backup, arcname)
    out.close( )
    return dest_path

Discussion

You can pass, as argument compression to function make_tar, the string 'gz' to get gzip compression instead of the default bzip2, or you can pass the empty string '' to get no compression at all. Besides making the file extension of the result either .tar, .tar.gz, or .tar.bz2, as appropriate, your choice for the compression argument determines which string is passed as the second argument to tarfile.TarFile.open: 'w', when you want no compression, or 'w:gz' or 'w:bz2' to get two kinds of compression.

Class tarfile.TarFile offers several other classmethods, besides open, which you could use to generate a suitable instance. I find open handier and more flexible because it takes the compression information as part of the mode string argument. However, if you want to ensure bzip2 compression is used unconditionally, for example, you could choose to call classmethod bz2open instead.

Once we have an instance of class tarfile.TarFile that is set to use the kind of compression we desire, the instance’s method add does all we require. In particular, when string folder_to_backup names a “directory” (or folder), rather than an ordinary file, add recursively adds all of the subtree rooted in that directory. If on some other occasion, we wanted to change this behavior to get precise control on what is archived, we could pass to add an additional named argument recursive=False to switch off this implicit recursion. After calling add, all that’s left for function make_tar to do is to close the TarFile instance and return the path on which the tar file has been written, just in case the caller needs this information.

See Also

Library Reference docs on module tarfile.

2.12. Sending Binary Data to Standard Output Under Windows

Credit: Hamish Lawson

Problem

You want to send binary data (e.g., an image) to stdout under Windows.

Solution

That’s what the setmode function, in the platform-dependent (Windows-only) msvcrt module in the Python Standard Library, is for:

import sys
if sys.platform == "win32":
    import os, msvcrtmsvcrt.setmode(sys.stdout.fileno( ), os.O_BINARY)

You can now call sys.stdout.write with any bytestring as the argument, and the bytestring will go unmodified to standard output.

Discussion

While Unix doesn’t make (or need) a distinction between text and binary modes, if you are reading or writing binary data, such as an image, under Windows, the file must be opened in binary mode. This is a problem for programs that write binary data to standard output (as a CGI script, for example, could be expected to do), because Python opens the sys.stdout file object on your behalf, normally in text mode.

You can have stdout opened in binary mode instead by supplying the -u command-line option to the Python interpreter. For example, if you know your CGI script will be running under the Apache web server, as the first line of your script, you can use something like:

#! c:/python23/python.exe -u

assuming you’re running under Python 2.3 with a standard installation. Unfortunately, you may not always be able to control the command line under which your script will be started. The approach taken in this recipe’s “Solution” offers a workable alternative. The setmode function provided by the Windows-specific msvcrt module lets you change the mode of stdout’s underlying file descriptor. By using this function, you can ensure from within your program that sys.stdout gets set to binary mode.

See Also

Documentation for the msvcrt module in the Library Reference and Python in a Nutshell.

2.13. Using a C++-like iostream Syntax

Credit: Erik Max Francis

Problem

You like the C++ approach to I/O, based on ostreams and manipulators (special objects that cause special effects on a stream when inserted in it) and want to use it in your Python programs.

Solution

Python lets you overload operators by having your classes define special methods (i.e., methods whose names start and end with two underscores). To use << for output, as you do in C++, you just need to code an output stream class that defines the special method _ _lshift_ _:

class IOManipulator(object):
    def _ _init_ _(self, function=None):
        self.function = function
    def do(self, output):
        self.function(output)
def do_endl(stream):
    stream.output.write('
')
    stream.output.flush( )
endl = IOManipulator(do_endl)
class OStream(object):
    def _ _init_ _(self, output=None):
        if output is None:
            import sys
            output = sys.stdout
        self.output = output
        self.format = '%s'
    def _ _lshift_ _(self, thing):
        ''' the special method which Python calls when you use the <<
            operator and the left-hand operand is an OStream '''
        if isinstance(thing, IOManipulator):
            thing.do(self)
        else:
            self.output.write(self.format % thing)
            self.format = '%s'
        return self
def example_main( ):
    cout = OStream( )
    cout<< "The average of " << 1 << " and " << 3 << " is " << (1+3)/2 <<endl
# emits The average of 1 and 3 is 4
if _ _name_ _ == '_ _main_ _':
    example_main( )

Discussion

Wrapping Python file-like objects to emulate C++ ostreams syntax is quite easy. This recipe shows how to code the insertion operator << for this purpose. The recipe also implements an IOManipulator class (as in C++) to call arbitrary functions on a stream upon insertion, and a predefined manipulator endl (guess where that name comes from) to write a newline and flush the stream.

The reason class OStream’s instances hold a format attribute and reset it to the default value '%s' after each self.output.write is so that you can build devious manipulators that temporarily save formatting state on the stream object, such as:

def do_hex(stream):
    stream.format = '%x'
hex = IOManipulator(do_hex)
cout << 23 << ' in hex is ' << hex << 23 << ', and in decimal ' << 23 << endl# emits 23 in hex is 17, and in decimal 23

Some people detest C++’s cout << something syntax, some love it. In cases such as the example given in the recipe, this syntax ends up simpler and more readable than:

print>>somewhere, "The average of %d and %d is %f
" % (1, 3, (1+3)/2)

which is the “Python-native” alternative (looking a lot like C in this case). It depends in part on whether you’re more used to C++ or to C. In any case, this recipe gives you a choice! Even if you don’t end up using this particular approach, it’s still interesting to see how simple operator overloading is in Python.

See Also

Library Reference and Python in a Nutshell docs on file objects and special methods such as _ _lshift_ _; Recipe 4.20 implements a Python version of C’s printf function.

2.14. Rewinding an Input File to the Beginning

Credit: Andrew Dalke

Problem

You need to make an input file object (with data coming from a socket or other input file handle) rewindable back to the beginning so you can read it over.

Solution

Wrap the file object into a suitable class:

from cStringIO import StringIO
class RewindableFile(object):
    """ Wrap a file handle to allow seeks back to the beginning. """
    def _ _init_ _(self, input_file):
        """ Wraps input_file into a file-like object with rewind. """
        self.file = input_file
        self.buffer_file = StringIO( )
        self.at_start = True
        try:
            self.start = input_file.tell( )
        except (IOError, AttributeError):
            self.start = 0
        self._use_buffer = True
    def seek(self, offset, whence=0):
        """ Seek to a given byte position.
        Must be: whence == 0 and offset == self.start
        """
        if whence != 0:
            raise ValueError("whence=%r; expecting 0" % (whence,))
        if offset != self.start:
            raise ValueError("offset=%r; expecting %s" % (offset, self.start))
        self.rewind( )
    def rewind(self):
        """ Simplified way to seek back to the beginning. """
        self.buffer_file.seek(0)
        self.at_start = True
    def tell(self):
        """ Return the current position of the file (must be at start).  """
        if not self.at_start:
            raise TypeError("RewindableFile can't tell except at start of file")
        return self.start
    def _read(self, size):
        if size < 0:             # read all the way to the end of the file
            y = self.file.read( )
            if self._use_buffer:
                self.buffer_file.write(y)
            return self.buffer_file.read( ) + y
        elif size == 0:          # no need to actually read the empty string
            return ""
        x = self.buffer_file.read(size)
        if len(x) < size:
            y = self.file.read(size - len(x))
            if self._use_buffer:
                self.buffer_file.write(y)
            return x + y
        return x
    def read(self, size=-1):
        """ Read up to 'size' bytes from the file.
        Default is -1, which means to read to end of file.
        """
        x = self._read(size)
        if self.at_start and x:
            self.at_start = False
        self._check_no_buffer( )
        return x
    def readline(self):
        """ Read a line from the file. """
        # Can we get it out of the buffer_file?
        s = self.buffer_file.readline( )
        if s[-1:] == "
":
            return s
        # No, so read a line from the input file
        t = self.file.readline( )
        if self._use_buffer:
            self.buffer_file.write(t)
        self._check_no_buffer( )
        return s + t
    def readlines(self):
        """read all remaining lines from the file"""
        return self.read( ).splitlines(True)
    def _check_no_buffer(self):
        # If 'nobuffer' has been called and we're finished with the buffer file,
        # get rid of the buffer, redirect everything to the original input file.
        if not self._use_buffer and 
               self.buffer_file.tell( ) == len(self.buffer_file.getvalue( )):
            # for top performance, we rebind all relevant methods in self
            for n in 'seek tell read readline readlines'.split( ):
                setattr(self, n, getattr(self.file, n, None))
            del self.buffer_file
    def nobuffer(self):
        """tell RewindableFile to stop using the buffer once it's exhausted"""
        self._use_buffer = False

Discussion

Sometimes, data coming from a socket or other input file handle isn’t what it was supposed to be. For example, suppose you are reading from a buggy server, which is supposed to return an XML stream, but sometimes returns an unformatted error message instead. (This scenario often occurs because many servers don’t handle incorrect input very well.)

This recipe’s RewindableFile class helps you solve this problem. r = RewindableFile(f) wraps the original input stream f into a “rewindable file” instance r which essentially mimics f’s behavior but also provides a buffer. Read requests to r are forwarded to f, and the data thus read gets appended to a buffer, then returned to the caller. The buffer contains all the data read so far.

r can be told to rewind, meaning to seek back to the start position. The next read request will come from the buffer, until the buffer has been read, in which case it gets the data from the input stream again. The newly read data is also appended to the buffer.

When buffering is no longer needed, call the nobuffer method of r. This tells r that, once it’s done reading the buffer’s current contents, it can throw the buffer away. After nobuffer is called, the behavior of seek is no longer defined.

For example, suppose you have a server that gives either an error message of the form ERROR: cannot do that, or an XML data stream, starting with '<?xml‘...:

    import RewindableFile
    infile = urllib2.urlopen("http://somewhere/")
    infile = RewindableFile.RewindableFile(infile)
    s = infile.readline( )
    if s.startswith("ERROR:"):
          raise Exception(s[:-1])
    infile.seek(0)
    infile.nobuffer( )   # Don't buffer the data any more... process the XML from infile ...

One sometimes-useful Python idiom is not supported by the class in this recipe: you can’t reliably stash away the bound methods of a RewindableFile instance. (If you don’t know what bound methods are, no problem, of course, since in that case you surely won’t want to stash them anywhere!). The reason for this limitation is that, when the buffer is empty, the RewindableFile code reassigns the input file’s read, readlines, etc., methods, as instance variables of self. This gives slightly better performance, at the cost of not supporting the infrequently-used idiom of saving bound methods. See Recipe 6.11 for another example of a similar technique, where an instance irreversibly changes its own methods.

The tell method, which gives the current location of a file, can be called on an instance of RewindableFile only right after wrapping, and before any reading, to get the beginning byte location. The RewindableFile implementation of tell tries to get the real position from the wrapped file, and use that as the beginning location. If the wrapped file does not support tell, then the RewindableFile implementation of tell just returns 0.

See Also

Site http://www.dalkescientific.com/Python/ for the latest version of this recipe’s code; Library Reference and Python in a Nutshell docs on file objects and module cStringIO; Recipe 6.11 for another example of an instance affecting an irreversible behavior change on itself by rebinding its methods.

2.15. Adapting a File-like Object to a True File Object

Credit: Michael Kent

Problem

You need to pass a file-like object (e.g., the results of a call such as urllib.urlopen) to a function or method that insists on receiving a true file object (e.g., a function such as marshal.load).

Solution

To cooperate with such type-checking, we need to write all data from the file-like object into a temporary file on disk. Then, we can use the (true) file object for that temporary disk file. Here’s a function that implements this idea:

import types, tempfile
CHUNK_SIZE = 16 * 1024
def adapt_file(fileObj):
    if isinstance(fileObj, file): return fileObj
    tmpFileObj = tempfile.TemporaryFile
    while True:
        data = fileObj.read(CHUNK_SIZE)
        if not data: break
        tmpFileObj.write(data)
    fileObj.close( )
    tmpFileObj.seek(0)
    return tmpFileObj

Discussion

This recipe demonstrates an unusual Pythonic application of the Adapter Design Pattern (i.e., what to do when you have an X and you need a Y instead). While design patterns are most normally thought of in an object-oriented way, and therefore implemented by writing classes, nothing is intrinsically necessary about that. In this case, for example, we don’t really need to introduce any new class, since the adapt_file function is obviously sufficient. Therefore, we respect Occam’s Razor and do not introduce entities without necessity.

One way or another, you should think in terms of adaptation, in preference to type testing, even when you need to rely on some lower-level utility that insists on precise types. Instead of raising an exception when you get passed an object that’s perfectly adequate save for the technicality of type membership, think of the possibility of adapting what you get passed to what you need. In this way, your code will be more flexible and more suitable for reuse.

See Also

Documentation on built-in file objects, and modules tempfile and marshal, in the Library Reference and Python in a Nutshell.

2.16. Walking Directory Trees

Credit: Robin Parmar, Alex Martelli

Problem

You need to examine a “directory”, or an entire directory tree rooted in a certain directory, and iterate on the files (and optionally folders) that match certain patterns.

Solution

The generator os.walk from the Python Standard Library module os is sufficient for this task, but we can dress it up a bit by coding our own function to wrap os.walk:

import os, fnmatch
def all_files(root, patterns='*', single_level=False, yield_folders=False):
    # Expand patterns from semicolon-separated string to list
    patterns = patterns.split(';')
    for path, subdirs, files in os.walk(root):
        if yield_folders:
            files.extend(subdirs)
        files.sort( )
        for name in files:
            for pattern in patterns:
                if fnmatch.fnmatch(name, pattern):
                    yield os.path.join(path, name)
                    break
        if single_level:
            break

Discussion

The standard directory tree traversal generator os.walk is powerful, simple, and flexible. However, as it stands, os.walk lacks a few niceties that applications may need, such as selecting files according to some patterns, flat (linear) looping on all files (and optionally folders) in sorted order, and the ability to examine a single directory (without entering its subdirectories). This recipe shows how easily these kinds of features can be added, by wrapping os.walk into another simple generator and using standard library module fnmatch to check filenames for matches to patterns.

The file patterns are possibly case-insensitive (that’s platform-dependent) but otherwise Unix-style, as supplied by the standard fnmatch module, which this recipe uses. To specify multiple patterns, join them with a semicolon. Note that this means that semicolons themselves can’t be part of a pattern.

For example, you can easily get a list of all Python and HTML files in directory /tmp or any subdirectory thereof:

thefiles = list(all_files('/tmp', '*.py;*.htm;*.html'))

Should you just want to process these files’ paths one at a time (e.g., print them, one per line), you do not need to build a list: you can simply loop on the result of calling all_files:

for path in all_files('/tmp', '*.py;*.htm;*.html'):
    print path

If your platform is case-sensitive, alnd you want case-sensitive matching, then you need to specify the patterns more laboriously, e.g., '*.[Hh][Tt][Mm][Ll]' instead of just '*.html‘.

See Also

Documentation for the os.path module and the os.walk generator, as well as the fnmatch module, in the Library Reference and Python in a Nutshell.

2.17. Swapping One File Extension for Another Throughout a Directory Tree

Credit: Julius Welby

Problem

You need to rename files throughout a subtree of directories, specifically changing the names of all files with a given extension so that they have a different extension instead.

Solution

Operating on all files of a whole subtree of directories is easy enough with the os.walk function from Python’s standard library:

import os
def swapextensions(dir, before, after):
    if before[:1] != '.':
        before = '.'+before
    thelen = -len(before)
    if after[:1] != '.':
        after = '.'+after
    for path, subdirs, files in os.walk(dir):
        for oldfile in files:
            if oldfile[thelen:] == before:
                oldfile = os.path.join(path, oldfile)
                newfile = oldfile[:thelen] + after
                os.rename(oldfile, newfile)
if _ _name_ _=='_ _main_ _':
    import sys
    if len(sys.argv) != 4:
        print "Usage: swapext rootdir before after"
        sys.exit(100)
    swapextensions(sys.argv[1], sys.argv[2], sys.argv[3])

Discussion

This recipe shows how to change the file extensions of all files in a specified directory, all of its subdirectories, all of their subdirectories, and so on. This technique is useful for changing the extensions of a whole batch of files in a folder structure, such as a web site. You can also use it to correct errors made when saving a batch of files programmatically.

The recipe is usable either as a module to be imported from any other, or as a script to run from the command line, and it is carefully coded to be platform-independent. You can pass in the extensions either with or without the leading dot (.), since the code in this recipe inserts that dot, if necessary. (As a consequence of this convenience, however, this recipe is unable to deal with files completely lacking any extension, including the dot; this limitation may be bothersome on Unix systems.)

The implementation of this recipe uses techniques that purists might consider too low level—specifically by dealing mostly with filenames and extensions by direct string manipulation, rather than by the functions in module os.path. It’s not a big deal: using os.path is fine, but using Python’s powerful string facilities to deal with filenames is fine, too.

2.18. Finding a File Given a Search Path

Credit: Chui Tey

Problem

Given a search path (a string of directories with a separator in between), you need to find the first file along the path with the requested name.

Solution

Basically, you need to loop over the directories in the given search path:

import os
def search_file(filename, search_path, pathsep=os.pathsep):
    """ Given a search path, find file with requested name """
    for path in search_path.split(pathsep):
        candidate = os.path.join(path, filename)
        if os.path.isfile(candidate):
            return os.path.abspath(candidate)
    return None
if _ _name_ _ == '_ _main_ _':
    search_path = '/bin' + os.pathsep + '/usr/bin'  # ; on Windows, : on Unix
    find_file = search_file('ls', search_path)
    if find_file:
        print "File 'ls' found at %s" % find_file
    else:
        print "File 'ls' not found"

Discussion

This recipe’s “Problem” is a reasonably frequent task, and Python makes resolving it extremely easy. Other recipes perform similar and related tasks: to find files specifically on Python’s own search path, see Recipe 2.20; to find all files matching a pattern along a search path, see Recipe 2.19.

The search loop can be coded in many ways, but returning the path (made into an absolute path, for uniformity and convenience) as soon as a hit is found is simplest as well as fast. The explicit return None after the loop is not strictly needed, since None is returned by Python when a function falls off the end. Having the return statement explicitly there in this case makes the functionality of search_file much clearer at first sight.

See Also

Recipe 2.20; Recipe 2.19; documentation for the module os in the Library Reference and Python in a Nutshell.

2.19. Finding Files Given a Search Path and a Pattern

Credit: Bill McNeill, Andrew Kirkpatrick

Problem

Given a search path (i.e., a string of directories with a separator in between), you need to find all files along the path whose names match a given pattern.

Solution

Basically, you need to loop over the directories in the given search path. The loop is best encapsulated in a generator:

import glob, os
def all_files(pattern, search_path, pathsep=os.pathsep):
    """ Given a search path, yield all files matching the pattern. """
    for path in search_path.split(pathsep):
        for match in glob.glob(os.path.join(path, pattern)):
            yield match

Discussion

One nice thing about generators is that you can easily use them to obtain just the first item, all items, or anything in between. For example, to print the first file matching '*.pye' along your environment’s PATH:

print all_files('*.pye', os.environ['PATH']).next( )

To print all such files, one per line:

for match in all_files('*.pye', os.environ['PATH']):
    print match

To print them all at once, as a list:

print list(all_files('*.pye', os.environ['PATH']))

I have also wrapped around this all_files function a main script to show all of the files with a given name along my PATH. Thus I can see not only which one will execute for that name (the first one), but also which ones are “shadowed” by that first one:

if _ _name_ _ == '_ _main_ _':
    import sys
    if len(sys.argv) != 2 or sys.argv[1].startswith('-'):
        print 'Use: %s <pattern>' % sys.argv[0]
        sys.exit(1)
    matches = list(all_files(sys.argv[1], os.environ['PATH']))
    print '%d match:' % len(matches)
    for match in matches:
        print match

See Also

Recipe 2.18 for a simpler approach to find the first file with a specified name along the path; Library Reference and Python in a Nutshell docs for modules os and glob.

2.20. Finding a File on the Python Search Path

Credit: Mitch Chapman

Problem

A large Python application includes resource files (e.g., Glade project files, SQL templates, and images) as well as Python packages. You want to store these associated files together with the Python packages that use them.

Solution

You need to be able to look for either files or directories along Python’s sys.path:

import sys, os
class Error(Exception): pass
def _find(pathname, matchFunc=os.path.isfile):
    for dirname in sys.path:
        candidate = os.path.join(dirname, pathname)
        if matchFunc(candidate):
            return candidate
    raise Error("Can't find file %s" % pathname)
def findFile(pathname):
    return _find(pathname)
def findDir(path):
    return _find(path, matchFunc=os.path.isdir)

Discussion

Larger Python applications consist of sets of Python packages and associated sets of resource files. It’s convenient to store these associated files together with the Python packages that use them, and it’s easy to do so if you use this variation on the previous Recipe 2.18 to find files or directories with pathnames relative to the Python search path.

See Also

Recipe 2.18; documentation for the os module in the Library Reference and Python in a Nutshell.

2.21. Dynamically Changing the PythonSearch Path

Credit: Robin Parmar

Problem

Modules must be on the Python search path before they can be imported, but you don’t want to set a huge permanent path because that slows performance—so, you want to change the path dynamically.

Solution

We simply conditionally add a “directory” to Python’s sys.path, carefully checking to avoid duplication:

def AddSysPath(new_path):
    """ AddSysPath(new_path): adds a "directory" to Python's sys.path
    Does not add the directory if it does not exist or if it's already on
    sys.path. Returns 1 if OK, -1 if new_path does not exist, 0 if it was
    already on sys.path.
    """
    import sys, os
    # Avoid adding nonexistent paths
    if not os.path.exists(new_path): return -1
    # Standardize the path.  Windows is case-insensitive, so lowercase
    # for definiteness if we are on Windows.
    new_path = os.path.abspath(new_path)
    if sys.platform == 'win32':
        new_path = new_path.lower( )
    # Check against all currently available paths
    for x in sys.path:
        x = os.path.abspath(x)
        if sys.platform == 'win32':
            x = x.lower( )
        if new_path in (x, x + os.sep):
            return 0sys.path.append(new_path)
    # if you want the new_path to take precedence over existing
    # directories already in sys.path, instead of appending, use:
    # sys.path.insert(0, new_path)
    return 1
if _ _name_ _ == '_ _main_ _':
    # Test and show usage
    import sys
    print 'Before:'
    for x in sys.path: print x
    if sys.platform == 'win32':
          print AddSysPath('c:\Temp')
          print AddSysPath('c:\temp')
    else:
          print AddSysPath('/usr/lib/my_modules')
    print 'After:'
    for x in sys.path: print x

Discussion

Modules must be in directories that are on the Python search path before they can be imported, but we don’t want to have a huge permanent path because doing so slows down every import performed by every Python script and application. This simple recipe dynamically adds a “directory” to the path, but only if that directory exists and was not already on sys.path.

sys.path is a list, so it’s easy to add directories to its end, using sys.path.append. Every import performed after such an append will automatically look in the newly added directory if it cannot be satisfied from earlier ones. As indicated in the Solution, you can alternatively use sys.path.insert(0, . . . so that the newly added directory is searched before ones that were already in sys.path.

It’s no big deal if sys.path ends up with some duplicates or if a nonexistent directory is accidentally appended to it; Python’s import statement is clever enough to shield itself against such issues. However, each time such a problem occurs at import time (e.g., from duplicate unsuccessful searches, errors from the operating system that need to be handled gracefully, etc.), a small price is paid in terms of performance. To avoid uselessly paying such a price, this recipe does a conditional addition to sys.path, never appending any directory that doesn’t exist or is already in sys.path. Directories appended by this recipe stay in sys.path only for the duration of this program’s run, just like any other dynamic alteration you might do to sys.path.

See Also

Documentation for the sys and os.path modules in the Library Reference and Python in a Nutshell.

2.22. Computing the Relative Path from One Directory to Another

Credit: Cimarron Taylor, Alan Ezust

Problem

You need to know the relative path from one directory to another—for example, to create a symbolic link or a relative reference in a URL.

Solution

The simplest approach is to split paths into lists of directories, then work on the lists. Using a couple of auxiliary and somewhat generic helper functions, we could code:

import os, itertools
def all_equal(elements):
    ''' return True if all the elements are equal, otherwise False. '''
    first_element = elements[0]
    for other_element in elements[1:]:
        if other_element != first_element: return False
    return True
def common_prefix(*sequences):
    ''' return a list of common elements at the start of all sequences,
        then a list of lists that are the unique tails of each sequence. '''
    # if there are no sequences at all, we're done
    if not sequences: return [  ], [  ]
    # loop in parallel on the sequences
    common = [  ]
    for elements in itertools.izip(*sequences):
        # unless all elements are equal, bail out of the loop
        if not all_equal(elements): break
        # got one more common element, append it and keep looping
        common.append(elements[0])
    # return the common prefix and unique tails
    return common, [ sequence[len(common):] for sequence in sequences ]
def relpath(p1, p2, sep=os.path.sep, pardir=os.path.pardir):
    ''' return a relative path from p1 equivalent to path p2.
        In particular: the empty string, if p1 == p2;
                       p2, if p1 and p2 have no common prefix.
    '''
    common, (u1, u2) = common_prefix(p1.split(sep), p2.split(sep))
    if not common:
        return p2      # leave path absolute if nothing at all in common
    return sep.join( [pardir]*len(u1) + u2 )
def test(p1, p2, sep=os.path.sep):
    ''' call function relpath and display arguments and results. '''
    print "from", p1, "to", p2, " -> ", relpath(p1, p2, sep)
if _ _name_ _ == '_ _main_ _':
    test('/a/b/c/d', '/a/b/c1/d1', '/')
    test('/a/b/c/d', '/a/b/c/d', '/')
    test('c:/x/y/z', 'd:/x/y/z', '/')

Discussion

The workhorse in this recipe is the simple but very general function common_prefix, which, given any N sequences, returns their common prefix and a list of their respective unique tails. To compute the relative path between two given paths, we can ignore their common prefix. We need only the appropriate number of move-up markers (normally, os.path.pardir—e.g., ../ on Unix-like systems; we need as many of them as the length of the unique tail of the starting path) followed by the unique tail of the destination path. So, function relpath splits the paths into lists of directories, calls common_prefix, and then performs exactly the construction just described.

common_prefix centers on the loop for elements in itertools.izip(*sequences), relying on the fact that izip ends with the shortest of the iterables it’s zipping. The body of the loop only needs to prematurely terminate the loop as soon as it meets a tuple of elements (coming one from each sequence, per izip’s specifications) that aren’t all equal, and to keep track of the elements that are equal by appending one of them to list common. Once the loop is done, all that’s left to prepare the lists to return is to slice off the elements that are already in common from the front of each of the sequences.

Function all_equal could alternatively be implemented in a completely different way, less simple and obvious, but interesting:

def all_equal(elements):
    return len(dict.fromkeys(elements)) == 1

or, equivalently and more concisely, in Python 2.4 only,

def all_equal(elements):
    return len(set(elements)) == 1

Saying that all elements are equal is exactly the same as saying that the set of the elements has cardinality (length) one. In the variation using dict.fromkeys, we use a dict to represent the set, so that variation works in Python 2.3 as well as in 2.4. The variation using set is clearer, but it only works in Python 2.4. (You could also make it work in version 2.3, as well as Python 2.4, by using the standard Python library module sets).

See Also

Library Reference and Python in a Nutshell docs for modules os and itertools.

2.23. Reading an Unbuffered Character in a Cross-Platform Way

Credit: Danny Yoo

Problem

Your application needs to read single characters, unbuffered, from standard input, and it needs to work on both Windows and Unix-like systems.

Solution

When we need a cross-platform solution, starting with platform-dependent ones, we need to wrap the different solutions so that they look the same:

try:
    from msvcrt import getch
except ImportError:
    ''' we're not on Windows, so we try the Unix-like approach '''
    def getch( ):
        import sys, tty, termios
        fd = sys.stdin.fileno( )
        old_settings = termios.tcgetattr(fd)
        try:
            tty.setraw(fd)
            ch = sys.stdin.read(1)
        finally:
            termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
        return ch

Discussion

On Windows, the standard Python library module msvcrt offers the handy getch function to read one character, unbuffered, from the keyboard, without echoing it to the screen. However, this module is not part of the standard Python library on Unix and Unix-like platforms, such as Linux and Mac OS X. On such platforms, we can get the same functionality with the tty and termios modules of the standard Python library (which, in turn, are not present on Windows).

The key point is that in application-level code, we should never have to worry about such issues; rather, we should write our application code in platform-independent ways, counting on library functions to paper over the differences between platforms. The Python Standard Library fulfills that role admirably for most tasks, but not all, and the problem posed by this recipe is an example of one for which the Python Standard Library doesn’t directly supply a cross-platform solution.

When we can’t find a ready-packaged cross-platform solution in the standard library, we should package it anyway as part of our own additional custom library. This recipe’s Solution, besides solving the specific task of the recipe, also shows one good general way to go about such packaging. (Alternatively, you can test sys.platform, but I prefer the approach shown in this recipe.)

Your own library module should try to import the standard library module it needs on a certain platform within a try clause and include a corresponding except ImportError clause that is triggered when the module is running on a different platform. In the body of that except clause, your own library module can apply whatever alternate approach will work on the different platform. In some rare cases, you may need more than two platform-dependent approaches, but most often you’ll need one approach on Windows and only one other approach to cover all other platforms. This is because most non-Windows platforms today are generally Unix or Unix-like.

See Also

Library Reference and Python in a Nutshell docs for msvcrt, tty, and termios.

2.24. Counting Pages of PDF Documents on Mac OS X

Credit: Dinu Gherman, Dan Wolfe

Problem

You’re running on a reasonably recent version of Mac OS X (version 10.3 “Panther” or later), and you need to know the number of pages in a PDF document.

Solution

The PDF format and Python are both natively integrated with Mac OS X (10.3 or later), and this allows a rather simple solution:

#!/usr/bin python
import CoreGraphics
def pageCount(pdfPath):
    "Return the number of pages for the PDF document at the given path."
    pdf = CoreGraphics.CGPDFDocumentCreateWithProvider(
              CoreGraphics.CGDataProviderCreateWithFilename(pdfPath)
          )
    return pdf.getNumberOfPages( )
if _ _name_ _ == '_ _main_ _':
    import sys
    for path in sys.argv[1:]:
        print pageCount(path)

Discussion

A reasonable alternative to this recipe might be to use the PyObjC Python extension, which (among other wonders) lets Python code reuse all the power in the Foundation and AppKit frameworks that come with Mac OS X. Such a choice would let you write a Python script that is also able to run on older versions of Mac OS X, such as 10.2 Jaguar. However, relying on Mac OS X 10.3 or later ensures we can use the Python installation that is integrated as a part of the operating system, as well as such goodies as the CoreGraphics Python extension module (also part of Mac OS X “Panther”) that lets your Python code reuse Apple’s excellent Quartz graphics engine directly.

See Also

PyObjC is at http://pyobjc.sourceforge.net/; information on the CoreGraphics module is at http://www.macdevcenter.com/pub/a/mac/2004/03/19/core_graphics.html.

2.25. Changing File Attributes on Windows

Credit: John Nielsen

Problem

You need to set the attributes of a file on Windows; for example, you may need to set the file as read-only, archived, and so on.

Solution

PyWin32’s win32api module offers a function SetFileAttributes that makes this task quite simple:

import win32con, win32api, os
# create a file, just to show how to manipulate it
thefile = 'test'
f = open('test', 'w')
f.close( )
# to make the file hidden...:
win32api.SetFileAttributes(thefile, win32con.FILE_ATTRIBUTE_HIDDEN)
# to make the file readonly:
win32api.SetFileAttributes(thefile, win32con.FILE_ATTRIBUTE_READONLY)
# to be able to delete the file we need to set it back to normal:
win32api.SetFileAttributes(thefile, win32con.FILE_ATTRIBUTE_NORMAL)
# and finally we remove the file we just made
os.remove(thefile)

Discussion

One interesting use of win32api.SetFileAttributes is to enable a file’s removal. Removing a file with os.remove can fail on Windows if the file’s attributes are not normal. To get around this problem, you just need to use the Win32 call to SetFileAttributes to convert it to a normal file, as shown at the end of this recipe’s Solution. Of course, this should be done with caution, since there may be a good reason the file is not “normal”. The file should be removed only if you know what you’re doing!

2.26. Extracting Text from OpenOffice.org Documents

Credit: Dirk Holtwick

Problem

You need to extract the text content (with or without the attending XML markup) from an OpenOffice.org document.

Solution

An OpenOffice.org document is just a zip file that aggregates XML documents according to a well-documented standard. To access our precious data, we don’t even need to have OpenOffice.org installed:

import zipfile, re
rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE)
def convert_OO(filename, want_text=True):
    """ Convert an OpenOffice.org document to XML or text. """
        zf = zipfile.ZipFile(filename, "r")
        data = zf.read("content.xml")
        zf.close( )
        if want_text:
            data = " ".join(rx_stripxml.sub(" ", data).split( ))
        return data
if _ _name_ _=="_ _main_ _":
    import sys
    if len(sys.argv)>1:
        for docname in sys.argv[1:]:
            print 'Text of', docname, ':'
            print convert_OO(docname)
            print 'XML of', docname, ':'
            print convert_OO(docname, want_text=False)
    else:
        print 'Call with paths to OO.o doc files to see Text and XML forms.'

Discussion

OpenOffice.org documents are zip files, and in addition to other contents, they always contain the file content.xml. This recipe’s job, therefore, essentially boils down to just extracting this file. By default, the recipe then throws away XML tags with a simple regular expression, splits the result by whitespace, and joins it up again with a single blank to save space. Of course, we could use an XML parser to get information in a vastly richer and more structured way, but if all we need is the rough textual content, this fast, rough-and-ready approach may suffice.

Specifically, the regular expression rx_stripxml matches any XML tag (opening or closing) from the leading < to the terminating >. Inside function convert_OO, in the statements guarded by if want_text, we use that regular expression to change every XML tag into a space, then normalize whitespace by splitting (i.e., calling the string method split, which splits on any sequence of whitespace), and rejoining (with " ".join, to use a single blank character as the joiner). Essentially, this split-and-rejoin process changes any sequence of whitespace into a single blank character. More advanced ways to extract all text from an XML document are shown in Recipe 12.3.

See Also

Library Reference docs on modules zipfile and re; OpenOffice.org’s web site, http://www.openoffice.org/; Recipe 12.3.

2.27. Extracting Text from Microsoft Word Documents

Credit: Simon Brunning, Pavel Kosina

Problem

You want to extract the text content from each Microsoft Word document in a directory tree on Windows into a corresponding text file.

Solution

With the PyWin32 extension, we can access Word itself, through COM, to perform the conversion:

import fnmatch, os, sys, win32com.client
wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")
try:
    for path, dirs, files in os.walk(sys.argv[1]):
        for filename in files:
            if not fnmatch.fnmatch(filename, '*.doc'): continue
            doc = os.path.abspath(os.path.join(path, filename))
            print "processing %s" % doc
            wordapp.Documents.Open(doc)
            docastxt = doc[:-3] + 'txt'
            wordapp.ActiveDocument.SaveAs(docastxt,
                FileFormat=win32com.client.constants.wdFormatText)
            wordapp.ActiveDocument.Close( )
finally:
    # ensure Word is properly shut down even if we get an exception
    wordapp.Quit( )

Discussion

A useful aspect of most Windows applications is that you can script them via COM, and the PyWin32 extension makes it fairly easy to perform COM scripting from Python. The extension enables you to write Python scripts to perform many kinds of Window tasks. The script in this recipe’s Solution drives Microsoft Word to extract the text from every .doc file in a “directory” tree into a corresponding .txt text file. Using the os.walk function, we can access every subdirectory in a tree with a simple for statement, without recursion. With the fnmatch.fnmatch function, we can check a filename to determine whether it matches an appropriate wildcard, here '*.doc‘. Once we have determined the name of a Word document file, we process that name with functions from os.path to turn it into a complete absolute path, and have Word open it, save it as text, and close it again.

If you don’t have Word, you may need to take a completely different approach. One possibility is to use OpenOffice.org, which is able to load Word documents. Another is to use a program specifically designed to read Word documents, such as Antiword, found at http://www.winfield.demon.nl/. However, we have not explored these alternative options.

See Also

Mark Hammond, Andy Robinson, Python Programming on Win32 (O’Reilly), for documentation on PyWin32; http://msdn.microsoft.com, for Microsoft’s documentation of the object model of Microsoft Word; Library Reference and Python in a Nutshell sections on modules fnmatch and os.path, and function os.walk.

2.28. File Locking Using a Cross-Platform API

Credit: Jonathan Feinberg, John Nielsen

Problem

You need to lock files in a program that runs on both Windows and Unix-like systems, but the Python Standard Library offers only platform-specific ways to lock files.

Solution

When the Python Standard Library itself doesn’t offer a cross-platform solution, it’s often possible to implement one ourselves:

import os
# needs win32all to work on Windows (NT, 2K, XP, _not_ /95 or /98)
if os.name == 'nt':
    import win32con, win32file, pywintypes
    LOCK_EX = win32con.LOCKFILE_EXCLUSIVE_LOCK
    LOCK_SH = 0 # the default
    LOCK_NB = win32con.LOCKFILE_FAIL_IMMEDIATELY
    _ _overlapped = pywintypes.OVERLAPPED( )
    def lock(file, flags):
        hfile = win32file._get_osfhandle(file.fileno( ))win32file.LockFileEx(hfile, flags, 0, 0xffff0000, _ _overlapped)
    def unlock(file):
        hfile = win32file._get_osfhandle(file.fileno( ))
        win32file.UnlockFileEx(hfile, 0, 0xffff0000, _ _overlapped)
elif os.name == 'posix':
    from fcntl import LOCK_EX, LOCK_SH, LOCK_NB
    def lock(file, flags):
        fcntl.flock(file.fileno( ), flags)
    def unlock(file):
        fcntl.flock(file.fileno( ), fcntl.LOCK_UN)
else:
    raise RuntimeError("PortaLocker only defined for nt and posix platforms")

Discussion

When multiple programs or threads have to access a shared file, it’s wise to ensure that accesses are synchronized so that two processes don’t try to modify the file contents at the same time. Failure to synchronize accesses could even corrupt the entire file in some cases.

This recipe supplies two functions, lock and unlock, that request and release locks on a file, respectively. Using the portalocker.py module is a simple matter of calling the lock function and passing in the file and an argument specifying the kind of lock that is desired:

Shared lock (default)

This lock denies all processes, including the process that first locks the file, write access to the file. All processes can read the locked file.

Exclusive lock

This denies all other processes both read and write access to the file.

Nonblocking lock

When this value is specified, the function returns immediately if it is unable to acquire the requested lock. Otherwise, it waits. LOCK_NB can be ORed with either LOCK_SH or LOCK_EX by using Python’s bitwise-or operator, the vertical bar (|).

For example:

import portalocker
afile = open("somefile", "r+")
portalocker.lock(afile, portalocker.LOCK_EX)

The implementation of the lock and unlock functions is entirely different on different systems. On Unix-like systems (including Linux and Mac OS X), the recipe relies on functionality made available by the standard fcntl module. On Windows systems (NT, 2000, XP—it doesn’t work on old Win/95 and Win/98 platforms because they just don’t have the needed oomph in the operating system!), the recipe uses the win32file module, part of the very popular PyWin32 package of Windows-specific extensions to Python, authored by Mark Hammond. But the important point is that, despite the differences in implementation, the functions (and the flags you can pass to the lock function) are made to behave in the same way across platforms. Such cross-platform packaging of differently implemented but equivalent functionality enables you to easily write cross-platform applications, which is one of Python’s strengths.

When you write a cross-platform program, it’s nice if the functionality that your program uses is, in turn, encapsulated in a cross-platform way. For file locking in particular, it is especially helpful to Perl users, who are used to an essentially transparent lock system call across platforms. More generally, if os.name== just does not belong in application-level code. Such platform testing ideally should always be in the standard library or an application-independent module, as it is here.

See Also

Documentation on the fcntl module in the Library Reference; documentation on the win32file module at http://ASPN.ActiveState.com/ASPN/Python/Reference/Products/ActivePython/PythonWin32Extensions/win32file.html; Jonathan Feinberg’s web site (http://MrFeinberg.com).

2.29. Versioning Filenames

Credit: Robin Parmar, Martin Miller

Problem

You want to make a backup copy of a file, before you overwrite it, with the standard convention of appending a three-digit version number to the name of the old file.

Solution

We just need to code a function to perform the backup copy appropriately:

def VersionFile(file_spec, vtype='copy'):
    import os, shutil
    if os.path.isfile(file_spec):
        # check the 'vtype' parameter
        if vtype not in ('copy', 'rename'):
             raise ValueError, 'Unknown vtype %r' % (vtype,)
        # Determine root filename so the extension doesn't get longer
        n, e = os.path.splitext(file_spec)
        # Is e a three-digits integer preceded by a dot?
        if len(e) == 4 and e[1:].isdigit( ):
            num = 1 + int(e[1:])
            root = n
        else:
            num = 0
            root = file_spec
        # Find next available file version
        for i in xrange(num, 1000):
             new_file = '%s.%03d' % (root, i)
             if not os.path.exists(new_file):
                  if vtype == 'copy':
                      shutil.copy(file_spec, new_file)
                  else:
                      os.rename(file_spec, new_file)
                  return True
        raise RuntimeError, "Can't %s %r, all names taken"%(vtype,file_spec)
    return False
if _ _name_ _ == '_ _main_ _':
      import os
      # create a dummy file 'test.txt'
      tfn = 'test.txt'
      open(tfn, 'w').close( )
      # version it 3 times
      print VersionFile(tfn)
      # emits: True
      print VersionFile(tfn)
      # emits: True
      print VersionFile(tfn)
      # emits: True
      # remove all test.txt* files we just made
      for x in ('', '.000', '.001', '.002'):
          os.unlink(tfn + x)
      # show what happens when the file does not exist
      print VersionFile(tfn)
      # emits: False
      print VersionFile(tfn)
      # emits: False

Discussion

The purpose of the VersionFile function is to ensure that an existing file is copied (or renamed, as indicated by the optional second parameter) before you open it for writing or updating and therefore modify it. It is polite to make such backups of files before you mangle them (one functionality some people still pine for from the good old VMS operating system, which performed it automatically!). The actual copy or renaming is performed by shutil.copy and os.rename, respectively, so the only issue is which name to use as the target.

A popular way to determine backups’ names is versioning (i.e., appending to the filename a gradually incrementing number). This recipe determines the new name by first extracting the filename’s root (just in case you call it with an already-versioned filename) and then successively appending to that root the further extensions .000, .001, and so on, until a name built in this manner does not correspond to any existing file. Then, and only then, is the name used as the target of a copy or renaming. Note that VersionFile is limited to 1,000 versions, so you should have an archive plan after that. The file must exist before it is first versioned—you cannot back up what does not yet exist. However, if the file doesn’t exist, function VersionFile simply returns False (while it returns True if the file exists and has been successfully versioned), so you don’t need to check before calling it!

See Also

Documentation for the os and shutil modules in the Library Reference and Python in a Nutshell.

2.30. Calculating CRC-64 Cyclic Redundancy Checks

Credit: Gian Paolo Ciceri

Problem

You need to ensure the integrity of some data by computing the data’s cyclic redundancy check (CRC), and you need to do so according to the CRC-64 specifications of the ISO-3309 standard.

Solution

The Python Standard Library does not include any implementation of CRC-64 (only one of CRC-32 in function zlib.crc32), so we need to program it ourselves. Fortunately, Python can perform bitwise operations (masking, shifting, bitwise-and, bitwise-or, xor, etc.) just as well as, say, C (and, in fact, with just about the same syntax), so it’s easy to transliterate a typical reference implementation of CRC-64 into a Python function as follows:

# prepare two auxiliary tables tables (using a function, for speed),
# then remove the function, since it's not needed any more:
CRCTableh = [0] * 256
CRCTablel = [0] * 256
def _inittables(CRCTableh, CRCTablel, POLY64REVh, BIT_TOGGLE):
    for i in xrange(256):
        partl = i
        parth = 0L
        for j in xrange(8):
            rflag = partl & 1L
            partl >>= 1L
            if parth & 1:
                partl ^= BIT_TOGGLE
            parth >>= 1L
            if rflag:
                parth ^= POLY64REVh
        CRCTableh[i] = parth
        CRCTablel[i] = partl
# first 32 bits of generator polynomial for CRC64 (the 32 lower bits are
# assumed to be zero) and bit-toggle mask used in _inittables
POLY64REVh = 0xd8000000L
BIT_TOGGLE = 1L << 31L
# run the function to prepare the tables
_inittables(CRCTableh, CRCTablel, POLY64REVh, BIT_TOGGLE)
# remove all names we don't need any more, including the function
del _inittables, POLY64REVh, BIT_TOGGLE
# this module exposes the following two functions: crc64, crc64digest
def crc64(bytes, (crch, crcl)=(0,0)):
    for byte in bytes:
        shr = (crch & 0xFF) << 24
        temp1h = crch >> 8L
        temp1l = (crcl >> 8L) | shr
        tableindex = (crcl ^ ord(byte)) & 0xFF
        crch = temp1h ^ CRCTableh[tableindex]
        crcl = temp1l ^ CRCTablel[tableindex]
    return crch, crcl
def crc64digest(aString):
    return "%08X%08X" % (crc64(bytes))
if _ _name_ _ == '_ _main_ _':
    # a little test/demo, for when this module runs as main-script
    assert crc64("IHATEMATH") == (3822890454, 2600578513)
    assert crc64digest("IHATEMATH") == "E3DCADD69B01ADD1"
    print 'crc64: dumb test successful'

Discussion

Cyclic redundancy checks (CRCs) are a popular way to ensure that data (in particular, a file) has not been accidentally damaged. CRCs can readily detect accidental damage, but they are not intended to withstand inimical assault the way other cryptographically strong checksums are. CRCs can be computed much faster than other kinds of checksums, making them useful in those cases where the only damage we need to guard against is accidental damage, rather than deliberate adversarial tampering.

Mathematically speaking, a CRC is computed as a polynomial over the bits of the data we’re checksumming. In practice, as this recipe shows, most of the computation can be done once and for all and summarized in tables that, when properly indexed, give the contribution of each byte of input data to the result. So, after initialization (which we do with an auxiliary function because computation in Python is much faster when using a function’s local variables than when using globals), actual CRC computation is quite fast. Both the computation of the tables and their use for CRC computation require a lot of bitwise operations, but, fortunately, Python’s just as good at such operations as other languages such as C. (In fact, Python’s syntax for the various bitwise operands is just about the same as C’s.)

The algorithm to compute the standard CRC-64 checksum is described in the ISO-3309 standard, and this recipe does nothing more than implement that algorithm. The generator polynomial is x64 + x4 + x3 + x + 1. (The “See Also” section within this recipe provides a reference for obtaining information about the computation.)

We represent the 64-bit result as a pair of Python ints, holding the low and high 32-bit halves of the result. To allow the CRC to be computed incrementally, in those cases where the data comes in a little at a time, we let the caller of function crc64 optionally feed in the “initial value” for the (crch, crcl) pair, presumably obtained by calling crc64 on previous parts of the data. To compute the CRC in one gulp, the caller just needs to pass in the data (a string of bytes), since in this case, we initialize the result to (0, 0) by default.

See Also

W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in C, 2d ed. (Cambridge University Press), pp. 896ff.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.133.61