6. Complete System Programs

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Complete System Programs

“The Greps of Wrath”

This chapter wraps up our look at the system interfaces domain in Python by presenting a collection of larger Python scripts that do real systems work—comparing and copying directory trees, splitting files, searching files and directories, testing other programs, configuring launched programs’ shell environments, and so on. The examples here are Python system utility programs that illustrate typical tasks and techniques in this domain and focus on applying built-in tools, such as file and directory tree processing.

Although the main point of this case-study chapter is to give you a feel for realistic scripts in action, the size of these examples also gives us an opportunity to see Python’s support for development paradigms like object-oriented programming (OOP) and reuse at work. It’s really only in the context of nontrivial programs such as the ones we’ll meet here that such tools begin to bear tangible fruit. This chapter also emphasizes the “why” of system tools, not just the “how”; along the way, I’ll point out real-world needs met by the examples we’ll study, to help you put the details in context.

One note up front: this chapter moves quickly, and a few of its examples are largely listed just for independent study. Because all the scripts here are heavily documented and use Python system tools described in the preceding chapters, I won’t go through all the code in exhaustive detail. You should read the source code listings and experiment with these programs on your own computer to get a better feel for how to combine system interfaces to accomplish realistic tasks. All are available in source code form in the book’s examples distribution and most work on all major platforms.

I should also mention that most of these are programs I have really used, not examples written just for this book. They were coded over a period of years and perform widely differing tasks, so there is no obvious common thread to connect the dots here other than need. On the other hand, they help explain why system tools are useful in the first place, demonstrate larger development concepts that simpler examples cannot, and bear collective witness to the simplicity and portability of automating system tasks with Python. Once you’ve mastered the basics, you’ll wish you had done so sooner.

A Quick Game of “Find the Biggest Python File”

Quick: what’s the biggest Python source file on your computer? This was the query innocently posed by a student in one of my Python classes. Because I didn’t know either, it became an official exercise in subsequent classes, and it provides a good example of ways to apply Python system tools for a realistic purpose in this book. Really, the query is a bit vague, because its scope is unclear. Do we mean the largest Python file in a directory, in a full directory tree, in the standard library, on the module import search path, or on your entire hard drive? Different scopes imply different solutions.

Scanning the Standard Library Directory

For instance, Example 6-1 is a first-cut solution that looks for the biggest Python file in one directory—a limited scope, but enough to get started.

Example 6-1. PP4ESystemFiletoolsigpy-dir.py

"""
Find the largest Python source file in a single directory.
Search Windows Python source lib, unless dir command-line arg.
"""

import os, glob, sys
dirname = r'C:Python31Lib' if len(sys.argv) == 1 else sys.argv[1]

allsizes = []
allpy = glob.glob(dirname + os.sep + '*.py')
for filename in allpy:
    filesize = os.path.getsize(filename)
    allsizes.append((filesize, filename))

allsizes.sort()
print(allsizes[:2])
print(allsizes[-2:])

This script uses the glob module to run through a directory’s files and detects the largest by storing sizes and names on a list that is sorted at the end—because size appears first in the list’s tuples, it will dominate the ascending value sort, and the largest percolates to the end of the list. We could instead keep track of the currently largest as we go, but the list scheme is more flexible. When run, this script scans the Python standard library’s source directory on Windows, unless you pass a different directory on the command line, and it prints both the two smallest and largest files it finds:

C:...PP4ESystemFiletools> bigpy-dir.py
[(0, 'C:\Python31\Lib\build_class.py'), (56, 'C:\Python31\Lib\struct.py')]
[(147086, 'C:\Python31\Lib\turtle.py'), (211238, 'C:\Python31\Lib\decimal.
py')]

C:...PP4ESystemFiletools> bigpy-dir.py .
[(21, '.\__init__.py'), (461, '.\bigpy-dir.py')]
[(1940, '.\bigext-tree.py'), (2547, '.\split.py')]

C:...PP4ESystemFiletools> bigpy-dir.py ..
[(21, '..\__init__.py'), (29, '..\testargv.py')]
[(541, '..\testargv2.py'), (549, '..\more.py')]

Scanning the Standard Library Tree

The prior section’s solution works, but it’s obviously a partial answer—Python files are usually located in more than one directory. Even within the standard library, there are many subdirectories for module packages, and they may be arbitrarily nested. We really need to traverse an entire directory tree. Moreover, the first output above is difficult to read; Python’s pprint (for “pretty print”) module can help here. Example 6-2 puts these extensions into code.

Example 6-2. PP4ESystemFiletoolsigpy-tree.py

"""
Find the largest Python source file in an entire directory tree.
Search the Python source lib, use pprint to display results nicely.
"""

import sys, os, pprint
trace = False
if sys.platform.startswith('win'):
    dirname = r'C:Python31Lib'                 # Windows
else:
    dirname = '/usr/lib/python'                  # Unix, Linux, Cygwin

allsizes = []
for (thisDir, subsHere, filesHere) in os.walk(dirname):
    if trace: print(thisDir)
    for filename in filesHere:
        if filename.endswith('.py'):
            if trace: print('...', filename)
            fullname = os.path.join(thisDir, filename)
            fullsize = os.path.getsize(fullname)
            allsizes.append((fullsize, fullname))

allsizes.sort()
pprint.pprint(allsizes[:2])
pprint.pprint(allsizes[-2:])

When run, this new version uses os.walk to search an entire tree of directories for the largest Python source file. Change this script’s trace variable if you want to track its progress through the tree. As coded, it searches the Python standard library’s source tree, tailored for Windows and Unix-like locations:

C:...PP4ESystemFiletools> bigpy-tree.py
[(0, 'C:\Python31\Lib\build_class.py'),
 (0, 'C:\Python31\Lib\email\mime\__init__.py')]
[(211238, 'C:\Python31\Lib\decimal.py'),
 (380582, 'C:\Python31\Lib\pydoc_data\topics.py')]

Scanning the Module Search Path

Sure enough—the prior section’s script found smallest and largest files in subdirectories. While searching Python’s entire standard library tree this way is more inclusive, it’s still incomplete: there may be additional modules installed elsewhere on your computer, which are accessible from the module import search path but outside Python’s source tree. To be more exhaustive, we could instead essentially perform the same tree search, but for every directory on the module import search path. Example 6-3 adds this extension to include every importable Python-coded module on your computer—located both on the path directly and nested in package directory trees.

Example 6-3. PP4ESystemFiletoolsigpy-path.py

"""
Find the largest Python source file on the module import search path.
Skip already-visited directories, normalize path and case so they will
match properly, and include line counts in pprinted result. It's not
enough to use os.environ['PYTHONPATH']: this is a subset of sys.path.
"""

import sys, os, pprint
trace = 0  # 1=dirs, 2=+files

visited  = {}
allsizes = []
for srcdir in sys.path:
    for (thisDir, subsHere, filesHere) in os.walk(srcdir):
        if trace > 0: print(thisDir)
        thisDir = os.path.normpath(thisDir)
        fixcase = os.path.normcase(thisDir)
        if fixcase in visited:
            continue
        else:
            visited[fixcase] = True
        for filename in filesHere:
            if filename.endswith('.py'):
                if trace > 1: print('...', filename)
                pypath = os.path.join(thisDir, filename)
                try:
                    pysize = os.path.getsize(pypath)
                except os.error:
                    print('skipping', pypath, sys.exc_info()[0])
                else:
                    pylines = len(open(pypath, 'rb').readlines())
                    allsizes.append((pysize, pylines, pypath))

print('By size...')
allsizes.sort()
pprint.pprint(allsizes[:3])
pprint.pprint(allsizes[-3:])

print('By lines...')
allsizes.sort(key=lambda x: x[1])
pprint.pprint(allsizes[:3])
pprint.pprint(allsizes[-3:])

When run, this script marches down the module import path and, for each valid directory it contains, attempts to search the entire tree rooted there. In fact, it nests loops three deep—for items on the path, directories in the item’s tree, and files in the directory. Because the module path may contain directories named in arbitrary ways, along the way this script must take care to:

Normalize directory paths—fixing up slashes and dots to map directories to a common form.
Normalize directory name case—converting to lowercase on case-insensitive Windows, so that same names match by string equality, but leaving case unchanged on Unix, where it matters.
Detect repeats to avoid visiting the same directory twice (the same directory might be reached from more than one entry on sys.path).
Skip any file-like item in the tree for which os.path.getsize fails (by default os.walk itself silently ignores things it cannot treat as directories, both at the top of and within the tree).
Avoid potential Unicode decoding errors in file content by opening files in binary mode in order to count their lines. Text mode requires decodable content, and some files in Python 3.1’s library tree cannot be decoded properly on Windows. Catching Unicode exceptions with a try statement would avoid program exits, too, but might skip candidate files.

This version also adds line counts; this might add significant run time to this script too, but it’s a useful metric to report. In fact, this version uses this value as a sort key to report the three largest and smallest files by line counts too—this may differ from results based upon raw file size. Here’s the script in action in Python 3.1 on my Windows 7 machine; since these results depend on platform, installed extensions, and path settings, your sys.path and largest and smallest files may vary:

C:...PP4ESystemFiletools> bigpy-path.py
By size...
[(0, 0, 'C:\Python31\lib\build_class.py'),
 (0, 0, 'C:\Python31\lib\email\mime\__init__.py'),
 (0, 0, 'C:\Python31\lib\email\test\__init__.py')]
[(161613, 3754, 'C:\Python31\lib\tkinter\__init__.py'),
 (211238, 5768, 'C:\Python31\lib\decimal.py'),
 (380582, 78, 'C:\Python31\lib\pydoc_data\topics.py')]
By lines...
[(0, 0, 'C:\Python31\lib\build_class.py'),
 (0, 0, 'C:\Python31\lib\email\mime\__init__.py'),
 (0, 0, 'C:\Python31\lib\email\test\__init__.py')]
[(147086, 4132, 'C:\Python31\lib\turtle.py'),
 (150069, 4268, 'C:\Python31\lib\test\test_descr.py'),
 (211238, 5768, 'C:\Python31\lib\decimal.py')]

Again, change this script’s trace variable if you want to track its progress through the tree. As you can see, the results for largest files differ when viewed by size and lines—a disparity which we’ll probably have to hash out in our next requirements meeting.

Scanning the Entire Machine

Finally, although searching trees rooted in the module import path normally includes every Python source file you can import on your computer, it’s still not complete. Technically, this approach checks only modules; Python source files which are top-level scripts run directly do not need to be included in the module path. Moreover, the module search path may be manually changed by some scripts dynamically at runtime (for example, by direct sys.path updates in scripts that run on web servers) to include additional directories that Example 6-3 won’t catch.

Ultimately, finding the largest source file on your computer requires searching your entire drive—a feat which our tree searcher in Example 6-2 almost supports, if we generalize it to accept the root directory name as an argument and add some of the bells and whistles of the path searcher version (we really want to avoid visiting the same directory twice if we’re scanning an entire machine, and we might as well skip errors and check line-based sizes if we’re investing the time). Example 6-4 implements such general tree scans, outfitted for the heavier lifting required for scanning drives.

Example 6-4. PP4ESystemFiletoolsigext-tree.py

"""
Find the largest file of a given type in an arbitrary directory tree.
Avoid repeat paths, catch errors, add tracing and line count size.
Also uses sets, file iterators and generator to avoid loading entire
file, and attempts to work around undecodable dir/file name prints.
"""

import os, pprint
from sys import argv, exc_info

trace = 1                                    # 0=off, 1=dirs, 2=+files
dirname, extname = os.curdir, '.py'          # default is .py files in cwd
if len(argv) > 1: dirname = argv[1]          # ex: C:, C:Python31Lib
if len(argv) > 2: extname = argv[2]          # ex: .pyw, .txt
if len(argv) > 3: trace   = int(argv[3])     # ex: ". .py 2"

def tryprint(arg):
    try:
        print(arg)                           # unprintable filename?
    except UnicodeEncodeError:
        print(arg.encode())                  # try raw byte string

visited  = set()
allsizes = []
for (thisDir, subsHere, filesHere) in os.walk(dirname):
    if trace: tryprint(thisDir)
    thisDir = os.path.normpath(thisDir)
    fixname = os.path.normcase(thisDir)
    if fixname in visited:
        if trace: tryprint('skipping ' + thisDir)
    else:
        visited.add(fixname)
        for filename in filesHere:
            if filename.endswith(extname):
                if trace > 1: tryprint('+++' + filename)
                fullname = os.path.join(thisDir, filename)
                try:
                    bytesize = os.path.getsize(fullname)
                    linesize = sum(+1 for line in open(fullname, 'rb'))
                except Exception:
                    print('error', exc_info()[0])
                else:
                    allsizes.append((bytesize, linesize, fullname))

for (title, key) in [('bytes', 0), ('lines', 1)]:
    print('
By %s...' % title)
    allsizes.sort(key=lambda x: x[key])
    pprint.pprint(allsizes[:3])
    pprint.pprint(allsizes[-3:])

Unlike the prior tree version, this one allows us to search in specific directories, and for specific extensions. The default is to simply search the current working directory for Python files:

C:...PP4ESystemFiletools> bigext-tree.py
.

By bytes...
[(21, 1, '.\__init__.py'),
 (461, 17, '.\bigpy-dir.py'),
 (818, 25, '.\bigpy-tree.py')]
[(1696, 48, '.\join.py'),
 (1940, 49, '.\bigext-tree.py'),
 (2547, 57, '.\split.py')]

By lines...
[(21, 1, '.\__init__.py'),
 (461, 17, '.\bigpy-dir.py'),
 (818, 25, '.\bigpy-tree.py')]
[(1696, 48, '.\join.py'),
 (1940, 49, '.\bigext-tree.py'),
 (2547, 57, '.\split.py')]

For more custom work, we can pass in a directory name, extension type, and trace level on the command-line now (trace level 0 disables tracing, and 1, the default, shows directories visited along the way):

C:...PP4ESystemFiletools> bigext-tree.py .. .py 0

By bytes...
[(21, 1, '..\__init__.py'),
 (21, 1, '..\Filetools\__init__.py'),
 (28, 1, '..\Streams\hello-out.py')]
[(2278, 67, '..\Processes\multi2.py'),
 (2547, 57, '..\Filetools\split.py'),
 (4361, 105, '..\Tester\tester.py')]

By lines...
[(21, 1, '..\__init__.py'),
 (21, 1, '..\Filetools\__init__.py'),
 (28, 1, '..\Streams\hello-out.py')]
[(2547, 57, '..\Filetools\split.py'),
 (2278, 67, '..\Processes\multi2.py'),
 (4361, 105, '..\Tester\tester.py')]

This script also lets us scan for different file types; here it is picking out the smallest and largest text file from one level up (at the time I ran this script, at least):

C:...PP4ESystemFiletools> bigext-tree.py .. .txt 1
..
..Environment
..Filetools
..Processes
..Streams
..Tester
..TesterArgs
..TesterErrors
..TesterInputs
..TesterOutputs
..TesterScripts
..Testerxxold
..Threads

By bytes...
[(4, 2, '..\Streams\input.txt'),
 (13, 1, '..\Streams\hello-in.txt'),
 (20, 4, '..\Streams\data.txt')]
[(104, 4, '..\Streams\output.txt'),
 (172, 3, '..\Tester\xxold\README.txt.txt'),
 (435, 4, '..\Filetools\temp.txt')]

By lines...
[(13, 1, '..\Streams\hello-in.txt'),
 (22, 1, '..\spam.txt'),
 (4, 2, '..\Streams\input.txt')]
[(20, 4, '..\Streams\data.txt'),
 (104, 4, '..\Streams\output.txt'),
 (435, 4, '..\Filetools\temp.txt')]

And now, to search your entire system, simply pass in your machine’s root directory name (use / instead of C: on Unix-like machines), along with an optional file extension type (.py is just the default now). The winner is…(please, no wagering):

C:...PP4EdevExamplesPP4ESystemFiletools> bigext-tree.py C:
C:
C:$Recycle.Bin
C:$Recycle.BinS-1-5-21-3951091421-2436271001-910485044-1004
C:cygwin
C:cygwinin
C:cygwincygdrive
C:cygwindev
C:cygwindevmqueue
C:cygwindevshm
C:cygwinetc
...MANY more lines omitted...

By bytes...
[(0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\build_class.py'),
 (0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\email\mime\__init__.py'),
 (0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\email\test\__init__.py')]
[(380582, 78, 'C:\Python31\Lib\pydoc_data\topics.py'),
 (398157, 83, 'C:\...\Install\Source\Python-2.6\Lib\pydoc_topics.py'),
 (412434, 83, 'C:\Python26\Lib\pydoc_topics.py')]

By lines...
[(0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\build_class.py'),
 (0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\email\mime\__init__.py'),
 (0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\email\test\__init__.py')]
[(204107, 5589, 'C:\...Install\Source\Python-3.0\Lib\decimal.py'),
 (205470, 5768, 'C:\cygwin\...\python31\Python-3.1.1\Lib\decimal.py'),
 (211238, 5768, 'C:\Python31\Lib\decimal.py')]

The script’s trace logic is preset to allow you to monitor its directory progress. I’ve shortened some directory names to protect the innocent here (and to fit on this page). This command may take a long time to finish on your computer—on my sadly underpowered Windows 7 netbook, it took 11 minutes to scan a solid state drive with some 59G of data, 200K files, and 25K directories when the system was lightly loaded (8 minutes when not tracing directory names, but half an hour when many other applications were running). Nevertheless, it provides the most exhaustive solution to the original query of all our attempts.

This is also as complete a solution as we have space for in this book. For more fun, consider that you may need to scan more than one drive, and some Python source files may also appear in zip archives, both on the module path or not (os.walk silently ignores zip files in Example 6-3). They might also be named in other ways—with .pyw extensions to suppress shell pop ups on Windows, and with arbitrary extensions for some top-level scripts. In fact, top-level scripts might have no filename extension at all, even though they are Python source files. And while they’re generally not Python files, some importable modules may also appear in frozen binaries or be statically linked into the Python executable. In the interest of space, we’ll leave such higher resolution (and potentially intractable!) search extensions as suggested exercises.

Printing Unicode Filenames

One fine point before we move on: notice the seemingly superfluous exception handling in Example 6-4’s tryprint function. When I first tried to scan an entire drive as shown in the preceding section, this script died on a Unicode encoding error while trying to print a directory name of a saved web page. Adding the exception handler skips the error entirely.

This demonstrates a subtle but pragmatically important issue: Python 3.X’s Unicode orientation extends to filenames, even if they are just printed. As we learned in Chapter 4, because filenames may contain arbitrary text, os.listdir returns filenames in two different ways—we get back decoded Unicode strings when we pass in a normal str argument, and still-encoded byte strings when we send a bytes:

>>> import os
>>> os.listdir('.')[:4]
['bigext-tree.py', 'bigpy-dir.py', 'bigpy-path.py', 'bigpy-tree.py']

>>> os.listdir(b'.')[:4]
[b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py', b'bigpy-tree.py']

Both os.walk (used in the Example 6-4 script) and glob.glob inherit this behavior for the directory and file names they return, because they work by calling os.listdir internally at each directory level. For all these calls, passing in a byte string argument suppresses Unicode decoding of file and directory names. Passing a normal string assumes that filenames are decodable per the file system’s Unicode scheme.

The reason this potentially mattered to this section’s example is that running the tree search version over an entire hard drive eventually reached an undecodable filename (an old saved web page with an odd name), which generated an exception when the print function tried to display it. Here’s a simplified recreation of the error, run in a shell window (Command Prompt) on Windows:

>>> root = r'C:py3000'
>>> for (dir, subs, files) in os.walk(root): print(dir)
...
C:py3000
C:py3000FutureProofPython - PythonInfo Wiki_files
C:py3000Oakwinter_com  Code » Porting setuptools to py3k_files
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:Python31libencodingscp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character 'u2019' in position
45: character maps to <undefined>

One way out of this dilemma is to use bytes strings for the directory root name—this suppresses filename decoding in the os.listdir calls run by os.walk, and effectively limits the scope of later printing to raw bytes. Since printing does not have to deal with encodings, it works without error. Manually encoding to bytes prior to printing works too, but the results are slightly different:

>>> root.encode()
b'C:\py3000'

>>> for (dir, subs, files) in os.walk(root.encode()): print(dir)
...
b'C:\py3000'
b'C:\py3000\FutureProofPython - PythonInfo Wiki_files'
b'C:\py3000\Oakwinter_com  Code xbb Porting setuptools to py3k_files'
b'C:\py3000\Whatx92s New in Python 3_0 x97 Python Documentation'

>>> for (dir, subs, files) in os.walk(root): print(dir.encode())
...
b'C:\py3000'
b'C:\py3000\FutureProofPython - PythonInfo Wiki_files'
b'C:\py3000\Oakwinter_com  Code xc2xbb Porting setuptools to py3k_files'
b'C:\py3000\Whatxe2x80x99s New in Python 3_0 xe2x80x94 Python Documentation'

Unfortunately, either approach means that all the directory names printed during the walk display as cryptic byte strings. To maintain the better readability of normal strings, I instead opted for the exception handler approach used in the script’s code. This avoids the issues entirely:

 >>> for (dir, subs, files) in os.walk(root):
...     try:
...         print(dir)
...     except UnicodeEncodeError:
...         print(dir.encode())           # or simply punt if enocde may fail too
...
C:py3000
C:py3000FutureProofPython - PythonInfo Wiki_files
C:py3000Oakwinter_com  Code » Porting setuptools to py3k_files
b'C:\py3000\Whatxe2x80x99s New in Python 3_0 xe2x80x94 Python Documentation'

Oddly, though, the error seems more related to printing than to Unicode encodings of filenames—because the filename did not fail until printed, it must have been decodable when its string was created initially. That’s why wrapping up the print in a try suffices; otherwise, the error would occur earlier.

Moreover, this error does not occur if the script’s output is redirected to a file, either at the shell level (bigext-tree.py c: > out), or by the print call itself (print(dir, file=F)). In the latter case the output file must later be read back in binary mode, as text mode triggers the same error when printing the file’s content to the shell window (but again, not until printed). In fact, the exact same code that fails when run in a system shell Command Prompt on Windows works without error when run in the IDLE GUI on the same platform—the tkinter GUI used by IDLE handles display of characters that printing to standard output connected to a shell terminal window does not:

>>> import os                  # run in IDLE (a tkinter GUI), not system shell
>>> root = r'C:py3000'
>>> for (dir, subs, files) in os.walk(root): print(dir)

C:py3000
C:py3000FutureProofPython - PythonInfo Wiki_files
C:py3000Oakwinter_com  Code » Porting setuptools to py3k_files
C:py3000What's New in Python 3_0 — Python Documentation_files

In other words, the exception occurs only when printing to a shell window, and long after the file name string is created. This reflects an artifact of extra translations performed by the Python printer, not of Unicode file names in general. Because we have no room for further exploration here, though, we’ll have to be satisfied with the fact that our exception handler sidesteps the printing problem altogether. You should still be aware of the implications of Unicode filename decoding, though; on some platforms you may need to pass byte strings to os.walk in this script to prevent decoding errors as filenames are created.^[18]

Since Unicode is still relatively new in 3.1, be sure to test for such errors on your computer and your Python. Also see also Python’s manuals for more on the treatment of Unicode filenames, and the text Learning Python for more on Unicode in general. As noted earlier, our scripts also had to open text files in binary mode because some might contain undecodable content too. It might seem surprising that Unicode issues can crop up in basic printing like this too, but such is life in the brave new Unicode world. Many real-world scripts don’t need to care much about Unicode, of course—including those we’ll explore in the next section.

Splitting and Joining Files

Like most kids, mine spent a lot of time on the Internet when they were growing up. As far as I could tell, it was the thing to do. Among their generation, computer geeks and gurus seem to have been held in the same sort of esteem that my generation once held rock stars. When kids disappeared into their rooms, chances were good that they were hacking on computers, not mastering guitar riffs (well, real ones, at least). It may or may not be healthier than some of the diversions of my own misspent youth, but that’s a topic for another kind of book.

Despite the rhetoric of techno-pundits about the Web’s potential to empower an upcoming generation in ways unimaginable by their predecessors, my kids seemed to spend most of their time playing games. To fetch new ones in my house at the time, they had to download to a shared computer which had Internet access and transfer those games to their own computers to install. (Their own machines did not have Internet access until later, for reasons that most parents in the crowd could probably expand upon.)

The problem with this scheme is that game files are not small. They were usually much too big to fit on a floppy or memory stick of the time, and burning a CD or DVD took away valuable game-playing time. If all the machines in my house ran Linux, this would have been a nonissue. There are standard command-line programs on Unix for chopping a file into pieces small enough to fit on a transfer device (split), and others for putting the pieces back together to re-create the original file (cat). Because we had all sorts of different machines in the house, though, we needed a more portable solution.^[19]

Splitting Files Portably

Since all the computers in my house ran Python, a simple portable Python script came to the rescue. The Python program in Example 6-5 distributes a single file’s contents among a set of part files and stores those part files in a directory.

Example 6-5. PP4ESystemFiletoolssplit.py

#!/usr/bin/python
"""
################################################################################
split a file into a set of parts; join.py puts them back together;
this is a customizable version of the standard Unix split command-line
utility; because it is written in Python, it also works on Windows and
can be easily modified; because it exports a function, its logic can
also be imported and reused in other applications;
################################################################################
"""

import sys, os
kilobytes = 1024
megabytes = kilobytes * 1000
chunksize = int(1.4 * megabytes)                   # default: roughly a floppy

def split(fromfile, todir, chunksize=chunksize):
    if not os.path.exists(todir):                  # caller handles errors
        os.mkdir(todir)                            # make dir, read/write parts
    else:
        for fname in os.listdir(todir):            # delete any existing files
            os.remove(os.path.join(todir, fname))
    partnum = 0
    input = open(fromfile, 'rb')                   # binary: no decode, endline
    while True:                                    # eof=empty string from read
        chunk = input.read(chunksize)              # get next part <= chunksize
        if not chunk: break
        partnum += 1
        filename = os.path.join(todir, ('part%04d' % partnum))
        fileobj  = open(filename, 'wb')
        fileobj.write(chunk)
        fileobj.close()                            # or simply open().write()
    input.close()
    assert partnum <= 9999                         # join sort fails if 5 digits
    return partnum

if __name__ == '__main__':
    if len(sys.argv) == 2 and sys.argv[1] == '-help':
        print('Use: split.py [file-to-split target-dir [chunksize]]')
    else:
        if len(sys.argv) < 3:
            interactive = True
            fromfile = input('File to be split? ')           # input if clicked
            todir    = input('Directory to store part files? ')
        else:
            interactive = False
            fromfile, todir = sys.argv[1:3]                  # args in cmdline
            if len(sys.argv) == 4: chunksize = int(sys.argv[3])
        absfrom, absto = map(os.path.abspath, [fromfile, todir])
        print('Splitting', absfrom, 'to', absto, 'by', chunksize)

        try:
            parts = split(fromfile, todir, chunksize)
        except:
            print('Error during split:')
            print(sys.exc_info()[0], sys.exc_info()[1])
        else:
            print('Split finished:', parts, 'parts are in', absto)
        if interactive: input('Press Enter key') # pause if clicked

By default, this script splits the input file into chunks that are roughly the size of a floppy disk—perfect for moving big files between the electronically isolated machines of the time. Most importantly, because this is all portable Python code, this script will run on just about any machine, even ones without their own file splitter. All it requires is an installed Python. Here it is at work splitting a Python 3.1 self-installer executable located in the current working directory on Windows (I’ve omitted a few dir output lines to save space here; use ls -l on Unix):

C:	emp> cd C:	emp

C:	emp> dir python-3.1.msi
...more...
06/27/2009  04:53 PM        13,814,272 python-3.1.msi
               1 File(s)     13,814,272 bytes
               0 Dir(s)  188,826,189,824 bytes free

C:	emp> python C:...PP4ESystemFiletoolssplit.py -help
Use: split.py [file-to-split target-dir [chunksize]]

C:	emp> python C:...P4ESystemFiletoolssplit.py python-3.1.msi pysplit
Splitting C:	emppython-3.1.msi to C:	emppysplit by 1433600
Split finished: 10 parts are in C:	emppysplit

C:	emp> dir pysplit
...more...
02/21/2010  11:13 AM    <DIR>          .
02/21/2010  11:13 AM    <DIR>          ..
02/21/2010  11:13 AM         1,433,600 part0001
02/21/2010  11:13 AM         1,433,600 part0002
02/21/2010  11:13 AM         1,433,600 part0003
02/21/2010  11:13 AM         1,433,600 part0004
02/21/2010  11:13 AM         1,433,600 part0005
02/21/2010  11:13 AM         1,433,600 part0006
02/21/2010  11:13 AM         1,433,600 part0007
02/21/2010  11:13 AM         1,433,600 part0008
02/21/2010  11:13 AM         1,433,600 part0009
02/21/2010  11:13 AM           911,872 part0010
              10 File(s)     13,814,272 bytes
               2 Dir(s)  188,812,328,960 bytes free

Each of these generated part files represents one binary chunk of the file python-3.1.msi—a chunk small enough to fit comfortably on a floppy disk of the time. In fact, if you add the sizes of the generated part files given by the ls command, you’ll come up with exactly the same number of bytes as the original file’s size. Before we see how to put these files back together again, here are a few points to ponder as you study this script’s code:

Operation modes

This script is designed to input its parameters in either interactive or command-line mode; it checks the number of command-line arguments to find out the mode in which it is being used. In command-line mode, you list the file to be split and the output directory on the command line, and you can optionally override the default part file size with a third command-line argument.

In interactive mode, the script asks for a filename and output directory at the console window with input and pauses for a key press at the end before exiting. This mode is nice when the program file is started by clicking on its icon; on Windows, parameters are typed into a pop-up DOS box that doesn’t automatically disappear. The script also shows the absolute paths of its parameters (by running them through os.path.abspath) because they may not be obvious in interactive mode.

Binary file mode

This code is careful to open both input and output files in binary mode (rb, wb), because it needs to portably handle things like executables and audio files, not just text. In Chapter 4, we learned that on Windows, text-mode files automatically map end-of-line sequences to on input and map to on output. For true binary data, we really don’t want any characters in the data to go away when read, and we don’t want any superfluous characters to be added on output. Binary-mode files suppress this mapping when the script is run on Windows and so avoid data corruption.

In Python 3.X, binary mode also means that file data is bytes objects in our script, not encoded str text, though we don’t need to do anything special—this script’s file processing code runs the same on Python 3.X as it did on 2.X. In fact, binary mode is required in 3.X for this program, because the target file’s data may not be encoded text at all; text mode requires that file content must be decodable in 3.X, and that might fail both for truly binary data and text files obtained from other platforms. On output, binary mode accepts bytes and suppresses Unicode encoding and line-end translations.

Manually closing files

This script also goes out of its way to manually close its files. As we also saw in Chapter 4, we can often get by with a single line: open(partname, 'wb').write(chunk). This shorter form relies on the fact that the current Python implementation automatically closes files for you when file objects are reclaimed (i.e., when they are garbage collected, because there are no more references to the file object). In this one-liner, the file object would be reclaimed immediately, because the open result is temporary in an expression and is never referenced by a longer-lived name. Similarly, the input file is reclaimed when the split function exits.

However, it’s not impossible that this automatic-close behavior may go away in the future. Moreover, the Jython Java-based Python implementation does not reclaim unreferenced objects as immediately as the standard Python. You should close manually if you care about the Java port, your script may potentially create many files in a short amount of time, and it may run on a machine that has a limit on the number of open files per program. Because the split function in this module is intended to be a general-purpose tool, it accommodates such worst-case scenarios. Also see Chapter 4’s mention of the file context manager and the with statement; this provides an alternative way to guarantee file closes.

Joining Files Portably

Back to moving big files around the house: after downloading a big game program file, you can run the previous splitter script by clicking on its name in Windows Explorer and typing filenames. After a split, simply copy each part file onto its own floppy (or other more modern medium), walk the files to the destination machine, and re-create the split output directory on the target computer by copying the part files. Finally, the script in Example 6-6 is clicked or otherwise run to put the parts back together.

Example 6-6. PP4ESystemFiletoolsjoin.py

#!/usr/bin/python
"""
################################################################################
join all part files in a dir created by split.py, to re-create file.
This is roughly like a 'cat fromdir/* > tofile' command on unix, but is
more portable and configurable, and exports the join operation as a
reusable function.  Relies on sort order of filenames: must be same
length.  Could extend split/join to pop up Tkinter file selectors.
################################################################################
"""

import os, sys
readsize = 1024

def join(fromdir, tofile):
    output = open(tofile, 'wb')
    parts  = os.listdir(fromdir)
    parts.sort()
    for filename in parts:
        filepath = os.path.join(fromdir, filename)
        fileobj  = open(filepath, 'rb')
        while True:
            filebytes = fileobj.read(readsize)
            if not filebytes: break
            output.write(filebytes)
        fileobj.close()
    output.close()

if __name__ == '__main__':
    if len(sys.argv) == 2 and sys.argv[1] == '-help':
        print('Use: join.py [from-dir-name to-file-name]')
    else:
        if len(sys.argv) != 3:
            interactive = True
            fromdir = input('Directory containing part files? ')
            tofile  = input('Name of file to be recreated? ')
        else:
            interactive = False
            fromdir, tofile = sys.argv[1:]
        absfrom, absto = map(os.path.abspath, [fromdir, tofile])
        print('Joining', absfrom, 'to make', absto)

        try:
            join(fromdir, tofile)
        except:
            print('Error joining files:')
            print(sys.exc_info()[0], sys.exc_info()[1])
        else:
           print('Join complete: see', absto)
        if interactive: input('Press Enter key') # pause if clicked

Here is a join in progress on Windows, combining the split files we made a moment ago; after running the join script, you still may need to run something like zip, gzip, or tar to unpack an archive file unless it’s shipped as an executable, but at least the original downloaded file is set to go^[20]:

C:	emp> python C:...PP4ESystemFiletoolsjoin.py -help
Use: join.py [from-dir-name to-file-name]

C:	emp> python C:...PP4ESystemFiletoolsjoin.py pysplit mypy31.msi
Joining C:	emppysplit to make C:	empmypy31.msi
Join complete: see C:	empmypy31.msi

C:	emp> dir *.msi
...more...
02/21/2010  11:21 AM        13,814,272 mypy31.msi
06/27/2009  04:53 PM        13,814,272 python-3.1.msi
               2 File(s)     27,628,544 bytes
               0 Dir(s)  188,798,611,456 bytes free

C:	emp> fc /b mypy31.msi python-3.1.msi
Comparing files mypy31.msi and PYTHON-3.1.MSI
FC: no differences encountered

The join script simply uses os.listdir to collect all the part files in a directory created by split, and sorts the filename list to put the parts back together in the correct order. We get back an exact byte-for-byte copy of the original file (proved by the DOS fc command in the code; use cmp on Unix).

Some of this process is still manual, of course (I never did figure out how to script the “walk the floppies to your bedroom” step), but the split and join scripts make it both quick and simple to move big files around. Because this script is also portable Python code, it runs on any platform to which we cared to move split files. For instance, my home computers ran both Windows and Linux at the time; since this script runs on either platform, the gamers were covered. Before we move on, here are a couple of implementation details worth underscoring in the join script’s code:

Reading by blocks or files

First of all, notice that this script deals with files in binary mode but also reads each part file in blocks of 1 KB each. In fact, the readsize setting here (the size of each block read from an input part file) has no relation to chunksize in split.py (the total size of each output part file). As we learned in Chapter 4, this script could instead read each part file all at once: output.write(open(filepath, 'rb').read()). The downside to this scheme is that it really does load all of a file into memory at once. For example, reading a 1.4 MB part file into memory all at once with the file object read method generates a 1.4 MB string in memory to hold the file’s bytes. Since split allows users to specify even larger chunk sizes, the join script plans for the worst and reads in terms of limited-size blocks. To be completely robust, the split script could read its input data in smaller chunks too, but this hasn’t become a concern in practice (recall that as your program runs, Python automatically reclaims strings that are no longer referenced, so this isn’t as wasteful as it might seem).

Sorting filenames

If you study this script’s code closely, you may also notice that the join scheme it uses relies completely on the sort order of filenames in the parts directory. Because it simply calls the list sort method on the filenames list returned by os.listdir, it implicitly requires that filenames have the same length and format when created by split. To satisfy this requirement, the splitter uses zero-padding notation in a string formatting expression ('part%04d') to make sure that filenames all have the same number of digits at the end (four). When sorted, the leading zero characters in small numbers guarantee that part files are ordered for joining correctly.

Alternatively, we could strip off digits in filenames, convert them with int, and sort numerically, by using the list sort method’s keys argument, but that would still imply that all filenames must start with the some type of substring, and so doesn’t quite remove the file-naming dependency between the split and join scripts. Because these scripts are designed to be two steps of the same process, though, some dependencies between them seem reasonable.

Usage Variations

Finally, let’s run a few more experiments with these Python system utilities to demonstrate other usage modes. When run without full command-line arguments, both split and join are smart enough to input their parameters interactively. Here they are chopping and gluing the Python self-installer file on Windows again, with parameters typed in the DOS console window:

C:	emp> python C:...PP4ESystemFiletoolssplit.py
File to be split? python-3.1.msi
Directory to store part files? splitout
Splitting C:	emppython-3.1.msi to C:	empsplitout by 1433600
Split finished: 10 parts are in C:	empsplitout
Press Enter key

C:	emp> python C:...PP4ESystemFiletoolsjoin.py
Directory containing part files? splitout
Name of file to be recreated? newpy31.msi
Joining C:	empsplitout to make C:	emp
ewpy31.msi
Join complete: see C:	emp
ewpy31.msi
Press Enter key

C:	emp> fc /B python-3.1.msi newpy31.msi
Comparing files python-3.1.msi and NEWPY31.MSI
FC: no differences encountered

When these program files are double-clicked in a Windows file explorer GUI, they work the same way (there are usually no command-line arguments when they are launched this way). In this mode, absolute path displays help clarify where files really are. Remember, the current working directory is the script’s home directory when clicked like this, so a simple name actually maps to a source code directory; type a full path to make the split files show up somewhere else:

[in a pop-up DOS console box when split.py is clicked]
File to be split? c:	emppython-3.1.msi
Directory to store part files? c:	empparts
Splitting c:	emppython-3.1.msi to c:	empparts by 1433600
Split finished: 10 parts are in c:	empparts
Press Enter key

[in a pop-up DOS console box when join.py is clicked]
Directory containing part files? c:	empparts
Name of file to be recreated? c:	empmorepy31.msi
Joining c:	empparts to make c:	empmorepy31.msi
Join complete: see c:	empmorepy31.msi
Press Enter key

Because these scripts package their core logic in functions, though, it’s just as easy to reuse their code by importing and calling from another Python component (make sure your module import search path includes the directory containing the PP4E root first; the first abbreviated line here is one way to do so):

C:	emp> set PYTHONPATH=C:...devExamples
C:	emp> python
>>> from PP4E.System.Filetools.split import split
>>> from PP4E.System.Filetools.join  import join
>>>
>>> numparts = split('python-3.1.msi', 'calldir')
>>> numparts
10
>>> join('calldir', 'callpy31.msi')
>>>
>>> import os
>>> os.system('fc /B python-3.1.msi callpy31.msi')
Comparing files python-3.1.msi and CALLPY31.msi
FC: no differences encountered
0

A word about performance: all the split and join tests shown so far process a 13 MB file, but they take less than one second of real wall-clock time to finish on my Windows 7 2GHz Atom processor laptop computer—plenty fast for just about any use I could imagine. Both scripts run just as fast for other reasonable part file sizes, too; here is the splitter chopping up the file into 4MB and 500KB parts:

C:	emp> C:...PP4ESystemFiletoolssplit.py python-3.1.msi tempsplit 4000000
Splitting C:	emppython-3.1.msi to C:	emp	empsplit by 4000000
Split finished: 4 parts are in C:	emp	empsplit

C:	emp> dir tempsplit
...more...
Directory of C:	emp	empsplit

02/21/2010  01:27 PM    <DIR>          .
02/21/2010  01:27 PM    <DIR>          ..
02/21/2010  01:27 PM         4,000,000 part0001
02/21/2010  01:27 PM         4,000,000 part0002
02/21/2010  01:27 PM         4,000,000 part0003
02/21/2010  01:27 PM         1,814,272 part0004
               4 File(s)     13,814,272 bytes
               2 Dir(s)  188,671,983,616 bytes free

C:	emp> C:...PP4ESystemFiletoolssplit.py python-3.1.msi tempsplit 500000
Splitting C:	emppython-3.1.msi to C:	emp	empsplit by 500000
Split finished: 28 parts are in C:	emp	empsplit

C:	emp> dir tempsplit
...more...
Directory of C:	emp	empsplit

02/21/2010  01:27 PM    <DIR>          .
02/21/2010  01:27 PM    <DIR>          ..
02/21/2010  01:27 PM           500,000 part0001
02/21/2010  01:27 PM           500,000 part0002
02/21/2010  01:27 PM           500,000 part0003
02/21/2010  01:27 PM           500,000 part0004
02/21/2010  01:27 PM           500,000 part0005
...more lines omitted...
02/21/2010  01:27 PM           500,000 part0024
02/21/2010  01:27 PM           500,000 part0025
02/21/2010  01:27 PM           500,000 part0026
02/21/2010  01:27 PM           500,000 part0027
02/21/2010  01:27 PM           314,272 part0028
              28 File(s)     13,814,272 bytes
               2 Dir(s)  188,671,946,752 bytes free

The split can take noticeably longer to finish, but only if the part file’s size is set small enough to generate thousands of part files—splitting into 1,382 parts works but runs slower (though some machines today are quick enough that you might not notice):

C:	emp> C:...PP4ESystemFiletoolssplit.py python-3.1.msi tempsplit 10000
Splitting C:	emppython-3.1.msi to C:	emp	empsplit by 10000
Split finished: 1382 parts are in C:	emp	empsplit

C:	emp> C:...PP4ESystemFiletoolsjoin.py tempsplit manypy31.msi
Joining C:	emp	empsplit to make C:	empmanypy31.msi
Join complete: see C:	empmanypy31.msi

C:	emp> fc /B python-3.1.msi manypy31.msi
Comparing files python-3.1.msi and MANYPY31.MSI
FC: no differences encountered

C:	emp> dir tempsplit
...more...
Directory of C:	emp	empsplit

02/21/2010  01:40 PM    <DIR>          .
02/21/2010  01:40 PM    <DIR>          ..
02/21/2010  01:39 PM            10,000 part0001
02/21/2010  01:39 PM            10,000 part0002
02/21/2010  01:39 PM            10,000 part0003
02/21/2010  01:39 PM            10,000 part0004
02/21/2010  01:39 PM            10,000 part0005
...over 1,000 lines deleted...
02/21/2010  01:40 PM            10,000 part1378
02/21/2010  01:40 PM            10,000 part1379
02/21/2010  01:40 PM            10,000 part1380
02/21/2010  01:40 PM            10,000 part1381
02/21/2010  01:40 PM             4,272 part1382
            1382 File(s)     13,814,272 bytes
               2 Dir(s)  188,651,008,000 bytes free

Finally, the splitter is also smart enough to create the output directory if it doesn’t yet exist and to clear out any old files there if it does exist—the following, for example, leaves only new files in the output directory. Because the joiner combines whatever files exist in the output directory, this is a nice ergonomic touch. If the output directory was not cleared before each split, it would be too easy to forget that a prior run’s files are still there. Given that target audience for these scripts, they needed to be as forgiving as possible; your user base may vary (though you often shouldn’t assume so).

C:	emp> C:...PP4ESystemFiletoolssplit.py python-3.1.msi tempsplit 5000000
Splitting C:	emppython-3.1.msi to C:	emp	empsplit by 5000000
Split finished: 3 parts are in C:	emp	empsplit

C:	emp> dir tempsplit
...more...
Directory of C:	emp	empsplit

02/21/2010  01:47 PM    <DIR>          .
02/21/2010  01:47 PM    <DIR>          ..
02/21/2010  01:47 PM         5,000,000 part0001
02/21/2010  01:47 PM         5,000,000 part0002
02/21/2010  01:47 PM         3,814,272 part0003
               3 File(s)     13,814,272 bytes
               2 Dir(s)  188,654,452,736 bytes free

Of course, the dilemma that these scripts address might today be more easily addressed by simply buying a bigger memory stick or giving kids their own Internet access. Still, once you catch the scripting bug, you’ll find the ease and flexibility of Python to be powerful and enabling tools, especially for writing custom automation scripts like these. When used well, Python may well become your Swiss Army knife of computing.

Generating Redirection Web Pages

Moving is rarely painless, even in cyberspace. Changing your website’s Internet address can lead to all sorts of confusion. You need to ask known contacts to use the new address and hope that others will eventually stumble onto it themselves. But if you rely on the Internet, moves are bound to generate at least as much confusion as an address change in the real world.

Unfortunately, such site relocations are often unavoidable. Both Internet Service Providers (ISPs) and server machines can come and go over the years. Moreover, some ISPs let their service fall to intolerably low levels; if you are unlucky enough to have signed up with such an ISP, there is not much recourse but to change providers, and that often implies a change of web addresses.^[21]

Imagine, though, that you are an O’Reilly author and have published your website’s address in multiple books sold widely all over the world. What do you do when your ISP’s service level requires a site change? Notifying each of the hundreds of thousands of readers out there isn’t exactly a practical solution.

Probably the best you can do is to leave forwarding instructions at the old site for some reasonably long period of time—the virtual equivalent of a “We’ve Moved” sign in a storefront window. On the Web, such a sign can also send visitors to the new site automatically: simply leave a page at the old site containing a hyperlink to the page’s address at the new site, along with timed auto-relocation specifications. With such forward-link files in place, visitors to the old addresses will be only one click or a few seconds away from reaching the new ones.

That sounds simple enough. But because visitors might try to directly access the address of any file at your old site, you generally need to leave one forward-link file for every old file—HTML pages, images, and so on. Unless your prior server supports auto-redirection (and mine did not), this represents a dilemma. If you happen to enjoy doing lots of mindless typing, you could create each forward-link file by hand. But given that my home site contained over 100 HTML files at the time I wrote this paragraph, the prospect of running one editor session per file was more than enough motivation for an automated solution.

Page Template File

Here’s what I came up with. First of all, I create a general page template text file, shown in Example 6-7, to describe how all the forward-link files should look, with parts to be filled in later.

Example 6-7. PP4ESystemFiletools emplate.html

<HTML>
<head>
<META HTTP-EQUIV="Refresh" CONTENT="10; URL=http://$server$/$home$/$file$">
<title>Site Redirection Page: $file$</title>
</head>
<BODY>

<H1>This page has moved</H1>
<P>This page now lives at this address:

<P><A HREF="http://$server$/$home$/$file$">
http://$server$/$home$/$file$</A>

<P>Please click on the new address to jump to this page, and
update any links accordingly.  You will be redirectly shortly.
</P>

<HR>
</BODY></HTML>

To fully understand this template, you have to know something about HTML, a web page description language that we’ll explore in Part IV. But for the purposes of this example, you can ignore most of this file and focus on just the parts surrounded by dollar signs: the strings $server$ , $home$ , and $file$ are targets to be replaced with real values by global text substitutions. They represent items that vary per site relocation and file.

Page Generator Script

Now, given a page template file, the Python script in Example 6-8 generates all the required forward-link files automatically.

Example 6-8. PP4ESystemFiletoolssite-forward.py

"""
################################################################################
Create forward-link pages for relocating a web site.
Generates one page for every existing site html file; upload the generated
files to your old web site.  See ftplib later in the book for ways to run
uploads in scripts either after or during page file creation.
################################################################################
"""

import os
servername   = 'learning-python.com'     # where site is relocating to
homedir      = 'books'                   # where site will be rooted
sitefilesdir = r'C:	emppublic_html'    # where site files live locally
uploaddir    = r'C:	empisp-forward'    # where to store forward files
templatename = 'template.html'           # template for generated pages

try:
    os.mkdir(uploaddir)                  # make upload dir if needed
except OSError: pass

template  = open(templatename).read()    # load or import template text
sitefiles = os.listdir(sitefilesdir)     # filenames, no directory prefix

count = 0
for filename in sitefiles:
    if filename.endswith('.html') or filename.endswith('.htm'):
        fwdname = os.path.join(uploaddir, filename)
        print('creating', filename, 'as', fwdname)

        filetext = template.replace('$server$', servername)   # insert text
        filetext = filetext.replace('$home$',   homedir)      # and write
        filetext = filetext.replace('$file$',   filename)     # file varies
        open(fwdname, 'w').write(filetext)
        count += 1

print('Last file =>
', filetext, sep='')
print('Done:', count, 'forward files created.')

Notice that the template’s text is loaded by reading a file; it would work just as well to code it as an imported Python string variable (e.g., a triple-quoted string in a module file). Also observe that all configuration options are assignments at the top of the script, not command-line arguments; since they change so seldom, it’s convenient to type them just once in the script itself.

But the main thing worth noticing here is that this script doesn’t care what the template file looks like at all; it simply performs global substitutions blindly in its text, with a different filename value for each generated file. In fact, we can change the template file any way we like without having to touch the script. Though a fairly simple technique, such a division of labor can be used in all sorts of contexts—generating “makefiles,” form letters, HTML replies from CGI scripts on web servers, and so on. In terms of library tools, the generator script:

Uses os.listdir to step through all the filenames in the site’s directory (glob.glob would work too, but may require stripping directory prefixes from file names)
Uses the string object’s replace method to perform global search-and-replace operations that fill in the $-delimited targets in the template file’s text, and endswith to skip non-HTML files (e.g., images—most browsers won’t know what to do with HTML text in a “.jpg” file)
Uses os.path.join and built-in file objects to write the resulting text out to a forward-link file of the same name in an output directory

The end result is a mirror image of the original website directory, containing only forward-link files generated from the page template. As an added bonus, the generator script can be run on just about any Python platform—I can run it on my Windows laptop (where I’m writing this book), as well as on a Linux server (where my http://learning-python.com domain is hosted). Here it is in action on Windows:

C:...PP4ESystemFiletools> python site-forward.py
creating about-lp.html as C:	empisp-forwardabout-lp.html
creating about-lp1e.html as C:	empisp-forwardabout-lp1e.html
creating about-lp2e.html as C:	empisp-forwardabout-lp2e.html
creating about-lp3e.html as C:	empisp-forwardabout-lp3e.html
creating about-lp4e.html as C:	empisp-forwardabout-lp4e.html
...many more lines deleted...
creating training.html as C:	empisp-forward	raining.html
creating whatsnew.html as C:	empisp-forwardwhatsnew.html
creating whatsold.html as C:	empisp-forwardwhatsold.html
creating xlate-lp.html as C:	empisp-forwardxlate-lp.html
creating zopeoutline.htm as C:	empisp-forwardzopeoutline.htm
Last file =>
<HTML>
<head>
<META HTTP-EQUIV="Refresh" CONTENT="10; URL=http://learning-python.com/books/zop
eoutline.htm">
<title>Site Redirection Page: zopeoutline.htm</title>
</head>
<BODY>

<H1>This page has moved</H1>
<P>This page now lives at this address:

<P><A HREF="http://learning-python.com/books/zopeoutline.htm">
http://learning-python.com/books/zopeoutline.htm</A>

<P>Please click on the new address to jump to this page, and
update any links accordingly.  You will be redirectly shortly.
</P>

<HR>
</BODY></HTML>
Done: 124 forward files created.

To verify this script’s output, double-click on any of the output files to see what they look like in a web browser (or run a start command in a DOS console on Windows—e.g., start isp-forwardabout-lp4e.html). Figure 6-1 shows what one generated page looks like on my machine.

Figure 6-1. Site-forward output file page

To complete the process, you still need to install the forward links: upload all the generated files in the output directory to your old site’s web directory. If that’s too much to do by hand, too, be sure to see the FTP site upload scripts in Chapter 13 for an automatic way to do that step with Python as well (PP4EInternetFtpuploadflat.py will do the job). Once you’ve started scripting in earnest, you’ll be amazed at how much manual labor Python can automate. The next section provides another prime example.

A Regression Test Script

Mistakes happen. As we’ve seen, Python provides interfaces to a variety of system services, along with tools for adding others. Example 6-9 shows some of the more commonly used system tools in action. It implements a simple regression test system for Python scripts—it runs each in a directory of Python scripts with provided input and command-line arguments, and compares the output of each run to the prior run’s results. As such, this script can be used as an automated testing system to catch errors introduced by changes in program source files; in a big system, you might not know when a fix is really a bug in disguise.

Example 6-9. PP4ESystemTester ester.py

"""
################################################################################
Test a directory of Python scripts, passing command-line arguments,
piping in stdin, and capturing stdout, stderr, and exit status to
detect failures and regressions from prior run outputs.  The subprocess
module spawns and controls streams (much like os.popen3 in Python 2.X),
and is cross-platform.  Streams are always binary bytes in subprocess.
Test inputs, args, outputs, and errors map to files in subdirectories.

This is a command-line script, using command-line arguments for
optional test directory name, and force-generation flag.  While we
could package it as a callable function, the fact that its results
are messages and output files makes a call/return model less useful.

Suggested enhancement: could be extended to allow multiple sets
of command-line arguments and/or inputs per test script, to run a
script multiple times (glob for multiple ".in*" files in Inputs?).
Might also seem simpler to store all test files in same directory
with different extensions, but this could grow large over time.
Could also save both stderr and stdout to Errors on failures, but
I prefer to have expected/actual output in Outputs on regressions.
################################################################################
"""

import os, sys, glob, time
from subprocess import Popen, PIPE

# configuration args
testdir  = sys.argv[1] if len(sys.argv) > 1 else os.curdir
forcegen = len(sys.argv) > 2
print('Start tester:', time.asctime())
print('in', os.path.abspath(testdir))

def verbose(*args):
    print('-'*80)
    for arg in args: print(arg)
def quiet(*args): pass
trace = quiet

# glob scripts to be tested
testpatt  = os.path.join(testdir, 'Scripts', '*.py')
testfiles = glob.glob(testpatt)
testfiles.sort()
trace(os.getcwd(), *testfiles)

numfail = 0
for testpath in testfiles:                      # run all tests in dir
    testname = os.path.basename(testpath)       # strip directory path

    # get input and args
    infile = testname.replace('.py', '.in')
    inpath = os.path.join(testdir, 'Inputs', infile)
    indata = open(inpath, 'rb').read() if os.path.exists(inpath) else b''

    argfile = testname.replace('.py', '.args')
    argpath = os.path.join(testdir, 'Args', argfile)
    argdata = open(argpath).read() if os.path.exists(argpath) else ''

    # locate output and error, scrub prior results
    outfile = testname.replace('.py', '.out')
    outpath = os.path.join(testdir, 'Outputs', outfile)
    outpathbad = outpath + '.bad'
    if os.path.exists(outpathbad): os.remove(outpathbad)

    errfile = testname.replace('.py', '.err')
    errpath = os.path.join(testdir, 'Errors', errfile)
    if os.path.exists(errpath): os.remove(errpath)

    # run test with redirected streams
    pypath = sys.executable
    command = '%s %s %s' % (pypath, testpath, argdata)
    trace(command, indata)

    process = Popen(command, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE)
    process.stdin.write(indata)
    process.stdin.close()
    outdata = process.stdout.read()
    errdata = process.stderr.read()                          # data are bytes
    exitstatus = process.wait()                              # requires binary files
    trace(outdata, errdata, exitstatus)

    # analyze results
    if exitstatus != 0:
        print('ERROR status:', testname, exitstatus)         # status and/or stderr
    if errdata:
        print('ERROR stream:', testname, errpath)            # save error text
        open(errpath, 'wb').write(errdata)

    if exitstatus or errdata:                                # consider both failure
        numfail += 1                                         # can get status+stderr
        open(outpathbad, 'wb').write(outdata)                # save output to view

    elif not os.path.exists(outpath) or forcegen:
        print('generating:', outpath)                        # create first output
        open(outpath, 'wb').write(outdata)

    else:
        priorout = open(outpath, 'rb').read()                # or compare to prior
        if priorout == outdata:
            print('passed:', testname)
        else:
            numfail += 1
            print('FAILED output:', testname, outpathbad)
            open(outpathbad, 'wb').write(outdata)

print('Finished:', time.asctime())
print('%s tests were run, %s tests failed.' % (len(testfiles), numfail))

We’ve seen the tools used by this script earlier in this part of the book—subprocess, os.path, glob, files, and the like. This example largely just pulls these tools together to solve a useful purpose. Its core operation is comparing new outputs to old, in order to spot changes (“regressions”). Along the way, it also manages command-line arguments, error messages, status codes, and files.

This script is also larger than most we’ve seen so far, but it’s a realistic and representative system administration tool (in fact, it’s derived from a similar tool I actually used in the past to detect changes in a compiler). Probably the best way to understand how it works is to demonstrate what it does. The next section steps through a testing session to be read in conjunction with studying the test script’s code.

Running the Test Driver

Much of the magic behind the test driver script in Example 6-9 has to do with its directory structure. When you run it for the first time in a test directory (or force it to start from scratch there by passing a second command-line argument), it:

Collects scripts to be run in the Scripts subdirectory
Fetches any associated script input and command-line arguments from the Inputs and Args subdirectories
Generates initial stdout output files for tests that exit normally in the Outputs subdirectory
Reports tests that fail either by exit status code or by error messages appearing in stderr

On all failures, the script also saves any stderr error message text, as well as any stdout data generated up to the point of failure; standard error text is saved to a file in the Errors subdirectory, and standard output of failed tests is saved with a special “.bad” filename extension in Outputs (saving this normally in the Outputs subdirectory would trigger a failure when the test is later fixed!). Here’s a first run:

C:...PP4ESystemTester> python tester.py . 1
Start tester: Mon Feb 22 22:13:38 2010
in C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
generating: .Outputs	est-basic-args.out
generating: .Outputs	est-basic-stdout.out
generating: .Outputs	est-basic-streams.out
generating: .Outputs	est-basic-this.out
ERROR status: test-errors-runtime.py 1
ERROR stream: test-errors-runtime.py .Errors	est-errors-runtime.err
ERROR status: test-errors-syntax.py 1
ERROR stream: test-errors-syntax.py .Errors	est-errors-syntax.err
ERROR status: test-status-bad.py 42
generating: .Outputs	est-status-good.out
Finished: Mon Feb 22 22:13:41 2010
8 tests were run, 3 tests failed.

To run each script, the tester configures any preset command-line arguments provided, pipes in fetched canned input (if any), and captures the script’s standard output and error streams, along with its exit status code. When I ran this example, there were 8 test scripts, along with a variety of inputs and outputs. Since the directory and file naming structures are the key to this example, here is a listing of the test directory I used—the Scripts directory is primary, because that’s where tests to be run are collected:

C:...PP4ESystemTester> dir /B
Args
Errors
Inputs
Outputs
Scripts
tester.py
xxold

C:...PP4ESystemTester> dir /B Scripts
test-basic-args.py
test-basic-stdout.py
test-basic-streams.py
test-basic-this.py
test-errors-runtime.py
test-errors-syntax.py
test-status-bad.py
test-status-good.py

The other subdirectories contain any required inputs and any generated outputs associated with scripts to be tested:

C:...PP4ESystemTester> dir /B Args
test-basic-args.args
test-status-good.args

C:...PP4ESystemTester> dir /B Inputs
test-basic-args.in
test-basic-streams.in

C:...PP4ESystemTester> dir /B Outputs
test-basic-args.out
test-basic-stdout.out
test-basic-streams.out
test-basic-this.out
test-errors-runtime.out.bad
test-errors-syntax.out.bad
test-status-bad.out.bad
test-status-good.out

C:...PP4ESystemTester> dir /B Errors
test-errors-runtime.err
test-errors-syntax.err

I won’t list all these files here (as you can see, there are many, and all are available in the book examples distribution package), but to give you the general flavor, here are the files associated with the test script test-basic-args.py:

C:...PP4ESystemTester> type Scripts	est-basic-args.py
# test args, streams
import sys, os
print(os.getcwd())                  # to Outputs
print(sys.path[0])

print('[argv]')
for arg in sys.argv:                # from Args
    print(arg)                      # to Outputs

print('[interaction]')              # to Outputs
text = input('Enter text:')         # from Inputs
rept = sys.stdin.readline()         # from Inputs
sys.stdout.write(text * int(rept))  # to Outputs

C:...PP4ESystemTester> type Args	est-basic-args.args
-command -line --stuff

C:...PP4ESystemTester> type Inputs	est-basic-args.in
Eggs
10

C:...PP4ESystemTester> type Outputs	est-basic-args.out
C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTesterScripts
[argv]
.Scripts	est-basic-args.py
-command
-line
--stuff
[interaction]
Enter text:EggsEggsEggsEggsEggsEggsEggsEggsEggsEggs

And here are two files related to one of the detected errors—the first is its captured stderr, and the second is its stdout generated up to the point where the error occurred; these are for human (or other tools) inspection, and are automatically removed the next time the tester script runs:

C:...PP4ESystemTester> type Errors	est-errors-runtime.err
Traceback (most recent call last):
  File ".Scripts	est-errors-runtime.py", line 3, in <module>
    print(1 / 0)
ZeroDivisionError: int division or modulo by zero

C:...PP4ESystemTester> type Outputs	est-errors-runtime.out.bad
starting

Now, when run again without making any changes to the tests, the test driver script compares saved prior outputs to new ones and detects no regressions; failures designated by exit status and stderr messages are still reported as before, but there are no deviations from other tests’ saved expected output:

C:...PP4ESystemTester> python tester.py
Start tester: Mon Feb 22 22:26:41 2010
in C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
passed: test-basic-args.py
passed: test-basic-stdout.py
passed: test-basic-streams.py
passed: test-basic-this.py
ERROR status: test-errors-runtime.py 1
ERROR stream: test-errors-runtime.py .Errors	est-errors-runtime.err
ERROR status: test-errors-syntax.py 1
ERROR stream: test-errors-syntax.py .Errors	est-errors-syntax.err
ERROR status: test-status-bad.py 42
passed: test-status-good.py
Finished: Mon Feb 22 22:26:43 2010
8 tests were run, 3 tests failed.

But when I make a change in one of the test scripts that will produce different output (I changed a loop counter to print fewer lines), the regression is caught and reported; the new and different output of the script is reported as a failure, and saved in Outputs as a “.bad” for later viewing:

C:...PP4ESystemTester> python tester.py
Start tester: Mon Feb 22 22:28:35 2010
in C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
passed: test-basic-args.py
FAILED output: test-basic-stdout.py .Outputs	est-basic-stdout.out.bad
passed: test-basic-streams.py
passed: test-basic-this.py
ERROR status: test-errors-runtime.py 1
ERROR stream: test-errors-runtime.py .Errors	est-errors-runtime.err
ERROR status: test-errors-syntax.py 1
ERROR stream: test-errors-syntax.py .Errors	est-errors-syntax.err
ERROR status: test-status-bad.py 42
passed: test-status-good.py
Finished: Mon Feb 22 22:28:38 2010
8 tests were run, 4 tests failed.

C:...PP4ESystemTester> type Outputs	est-basic-stdout.out.bad
begin
Spam!
Spam!Spam!
Spam!Spam!Spam!
Spam!Spam!Spam!Spam!
end

One last usage note: if you change the trace variable in this script to be verbose, you’ll get much more output designed to help you trace the programs operation (but probably too much for real testing runs):

C:...PP4ESystemTester> tester.py
Start tester: Mon Feb 22 22:34:51 2010
in C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
--------------------------------------------------------------------------------
C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
.Scripts	est-basic-args.py
.Scripts	est-basic-stdout.py
.Scripts	est-basic-streams.py
.Scripts	est-basic-this.py
.Scripts	est-errors-runtime.py
.Scripts	est-errors-syntax.py
.Scripts	est-status-bad.py
.Scripts	est-status-good.py
--------------------------------------------------------------------------------
C:Python31python.exe .Scripts	est-basic-args.py -command -line --stuff
b'Eggs
10
'
--------------------------------------------------------------------------------
b'C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\System\Tester


C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\System\Tester\
Scripts
[argv]
.\Scripts\test-basic-args.py
-command
-line
--st
uff
[interaction]
Enter text:EggsEggsEggsEggsEggsEggsEggsEggsEggsEggs'
b''
0
passed: test-basic-args.py
...more lines deleted...

Study the test driver’s code for more details. Naturally, there is much more to the general testing story than we have space for here. For example, in-process tests don’t need to spawn programs and can generally make do with importing modules and testing them in try exception handler statements. There is also ample room for expansion and customization in our testing script (see its docstring for starters). Moreover, Python comes with two testing frameworks, doctest and unittest (a.k.a. PyUnit), which provide techniques and structures for coding regression and unit tests:

unittest: An object-oriented framework that specifies test cases, expected results, and test suites. Subclasses provide test methods and use inherited assertion calls to specify expected results.
doctest: Parses out and reruns tests from an interactive session log that is pasted into a module’s docstrings. The logs give test calls and expected results; doctest essentially reruns the interactive session.

See the Python library manual, the PyPI website, and your favorite Web search engine for additional testing toolkits in both Python itself and the third-party domain.

For automated testing of Python command-line scripts that run as independent programs and tap into standard script execution context, though, our tester does the job. Because the test driver is fully independent of the scripts it tests, we can drop in new test cases without having to update the driver’s code. And because it is written in Python, it’s quick and easy to change as our testing needs evolve. As we’ll see again in the next section, this “scriptability” that Python provides can be a decided advantage for real tasks.

Once we learn about sending email from Python scripts in Chapter 13, you might also want to augment this script to automatically send out email when regularly run tests fail (e.g., when run from a cron job on Unix). That way, you don’t even need to remember to check results. Of course, you could go further still.

One company I worked for added sound effects to compiler test scripts; you got an audible round of applause if no regressions were found and an entirely different noise otherwise. (See playfile.py at the end of this chapter for hints.)

Another company in my development past ran a nightly test script that automatically isolated the source code file check-in that triggered a test regression and sent a nasty email to the guilty party (and his or her supervisor). Nobody expects the Spanish Inquisition!

Copying Directory Trees

My CD writer sometimes does weird things. In fact, copies of files with odd names can be totally botched on the CD, even though other files show up in one piece. That’s not necessarily a showstopper; if just a few files are trashed in a big CD backup copy, I can always copy the offending files elsewhere one at a time. Unfortunately, drag-and-drop copies on some versions of Windows don’t play nicely with such a CD: the copy operation stops and exits the moment the first bad file is encountered. You get only as many files as were copied up to the error, but no more.

In fact, this is not limited to CD copies. I’ve run into similar problems when trying to back up my laptop’s hard drive to another drive—the drag-and-drop copy stops with an error as soon as it reaches a file with a name that is too long or odd to copy (common in saved web pages). The last 30 minutes spent copying is wasted time; frustrating, to say the least!

There may be some magical Windows setting to work around this feature, but I gave up hunting for one as soon as I realized that it would be easier to code a copier in Python. The cpall.py script in Example 6-10 is one way to do it. With this script, I control what happens when bad files are found—I can skip over them with Python exception handlers, for instance. Moreover, this tool works with the same interface and effect on other platforms. It seems to me, at least, that a few minutes spent writing a portable and reusable Python script to meet a need is a better investment than looking for solutions that work on only one platform (if at all).

Example 6-10. PP4ESystemFiletoolscpall.py

"""
################################################################################
Usage: "python cpall.py dirFrom dirTo".
Recursive copy of a directory tree.  Works like a "cp -r dirFrom/* dirTo"
Unix command, and assumes that dirFrom and dirTo are both directories.
Was written to get around fatal error messages under Windows drag-and-drop
copies (the first bad file ends the entire copy operation immediately),
but also allows for coding more customized copy operations in Python.
################################################################################
"""

import os, sys
maxfileload = 1000000
blksize = 1024 * 500

def copyfile(pathFrom, pathTo, maxfileload=maxfileload):
    """
    Copy one file pathFrom to pathTo, byte for byte;
    uses binary file modes to supress Unicde decode and endline transform
    """
    if os.path.getsize(pathFrom) <= maxfileload:
        bytesFrom = open(pathFrom, 'rb').read()   # read small file all at once
        open(pathTo, 'wb').write(bytesFrom)
    else:
        fileFrom = open(pathFrom, 'rb')           # read big files in chunks
        fileTo   = open(pathTo,   'wb')           # need b mode for both
        while True:
            bytesFrom = fileFrom.read(blksize)    # get one block, less at end
            if not bytesFrom: break               # empty after last chunk
            fileTo.write(bytesFrom)

def copytree(dirFrom, dirTo, verbose=0):
    """
    Copy contents of dirFrom and below to dirTo, return (files, dirs) counts;
    may need to use bytes for dirnames if undecodable on other platforms;
    may need to do more file type checking on Unix: skip links, fifos, etc.
    """
    fcount = dcount = 0
    for filename in os.listdir(dirFrom):                  # for files/dirs here
        pathFrom = os.path.join(dirFrom, filename)
        pathTo   = os.path.join(dirTo,   filename)        # extend both paths
        if not os.path.isdir(pathFrom):                   # copy simple files
            try:
                if verbose > 1: print('copying', pathFrom, 'to', pathTo)
                copyfile(pathFrom, pathTo)
                fcount += 1
            except:
                print('Error copying', pathFrom, 'to', pathTo, '--skipped')
                print(sys.exc_info()[0], sys.exc_info()[1])
        else:
            if verbose: print('copying dir', pathFrom, 'to', pathTo)
            try:
                os.mkdir(pathTo)                          # make new subdir
                below = copytree(pathFrom, pathTo)        # recur into subdirs
                fcount += below[0]                        # add subdir  counts
                dcount += below[1]
                dcount += 1
            except:
                print('Error creating', pathTo, '--skipped')
                print(sys.exc_info()[0], sys.exc_info()[1])
    return (fcount, dcount)

def getargs():
    """
    Get and verify directory name arguments, returns default None on errors
    """
    try:
        dirFrom, dirTo = sys.argv[1:]
    except:
        print('Usage error: cpall.py dirFrom dirTo')
    else:
        if not os.path.isdir(dirFrom):
            print('Error: dirFrom is not a directory')
        elif not os.path.exists(dirTo):
            os.mkdir(dirTo)
            print('Note: dirTo was created')
            return (dirFrom, dirTo)
        else:
            print('Warning: dirTo already exists')
            if hasattr(os.path, 'samefile'):
                same = os.path.samefile(dirFrom, dirTo)
            else:
                same = os.path.abspath(dirFrom) == os.path.abspath(dirTo)
            if same:
                print('Error: dirFrom same as dirTo')
            else:
                return (dirFrom, dirTo)

if __name__ == '__main__':
    import time
    dirstuple = getargs()
    if dirstuple:
        print('Copying...')
        start = time.clock()
        fcount, dcount = copytree(*dirstuple)
        print('Copied', fcount, 'files,', dcount, 'directories', end=' ')
        print('in', time.clock() - start, 'seconds')

This script implements its own recursive tree traversal logic and keeps track of both the “from” and “to” directory paths as it goes. At every level, it copies over simple files, creates directories in the “to” path, and recurs into subdirectories with “from” and “to” paths extended by one level. There are other ways to code this task (e.g., we might change the working directory along the way with os.chdir calls or there is probably an os.walk solution which replaces from and to path prefixes as it walks), but extending paths on recursive descent works well in this script.

Notice this script’s reusable copyfile function—just in case there are multigigabyte files in the tree to be copied, it uses a file’s size to decide whether it should be read all at once or in chunks (remember, the file read method without arguments actually loads the entire file into an in-memory string). We choose fairly large file and block sizes, because the more we read at once in Python, the faster our scripts will typically run. This is more efficient than it may sound; strings left behind by prior reads will be garbage collected and reused as we go. We’re using binary file modes here again, too, to suppress the Unicode encodings and end-of-line translations of text files—trees may contain arbitrary kinds of files.

Also notice that this script creates the “to” directory if needed, but it assumes that the directory is empty when a copy starts up; for accuracy, be sure to remove the target directory before copying a new tree to its name, or old files may linger in the target tree (we could automatically remove the target first, but this may not always be desired). This script also tries to determine if the source and target are the same; on Unix-like platforms with oddities such as links, os.path.samefile does a more accurate job than comparing absolute file names (different file names may be the same file).

Here is a copy of a big book examples tree (I use the tree from the prior edition throughout this chapter) in action on Windows; pass in the name of the “from” and “to” directories to kick off the process, redirect the output to a file if there are too many error messages to read all at once (e.g., > output.txt), and run an rm –r or rmdir /S shell command (or similar platform-specific tool) to delete the target directory first if needed:

C:...PP4ESystemFiletools> rmdir /S copytemp
copytemp, Are you sure (Y/N)? y

C:...PP4ESystemFiletools> cpall.py C:	empPP3EExamples copytemp
Note: dirTo was created
Copying...
Copied 1430 files, 185 directories in 10.4470980971 seconds

C:...PP4ESystemFiletools> fc /B copytempPP3ELauncher.py
                                    C:	empPP3EExamplesPP3ELauncher.py
Comparing files COPYTEMPPP3ELauncher.py and C:TEMPPP3EEXAMPLESPP3ELAUNCHER.PY
FC: no differences encountered

You can use the copy function’s verbose argument to trace the process if you wish. At the time I wrote this edition in 2010, this test run copied a tree of 1,430 files and 185 directories in 10 seconds on my woefully underpowered netbook machine (the built-in time.clock call is used to query the system time in seconds); it may run arbitrarily faster or slower for you. Still, this is at least as fast as the best drag-and-drop I’ve timed on this machine.

So how does this script work around bad files on a CD backup? The secret is that it catches and ignores file exceptions, and it keeps walking. To copy all the files that are good on a CD, I simply run a command line such as this one:

C:...PP4ESystemFiletools> python cpall.py G:Examples C:PP3EExamples

Because the CD is addressed as “G:” on my Windows machine, this is the command-line equivalent of drag-and-drop copying from an item in the CD’s top-level folder, except that the Python script will recover from errors on the CD and get the rest. On copy errors, it prints a message to standard output and continues; for big copies, you’ll probably want to redirect the script’s output to a file for later inspection.

In general, cpall can be passed any absolute directory path on your machine, even those that indicate devices such as CDs. To make this go on Linux, try a root directory such as /dev/cdrom or something similar to address your CD drive. Once you’ve copied a tree this way, you still might want to verify; to see how, let’s move on to the next example.

Comparing Directory Trees

Engineers can be a paranoid sort (but you didn’t hear that from me). At least I am. It comes from decades of seeing things go terribly wrong, I suppose. When I create a CD backup of my hard drive, for instance, there’s still something a bit too magical about the process to trust the CD writer program to do the right thing. Maybe I should, but it’s tough to have a lot of faith in tools that occasionally trash files and seem to crash my Windows machine every third Tuesday of the month. When push comes to shove, it’s nice to be able to verify that data copied to a backup CD is the same as the original—or at least to spot deviations from the original—as soon as possible. If a backup is ever needed, it will be really needed.

Because data CDs are accessible as simple directory trees in the file system, we are once again in the realm of tree walkers—to verify a backup CD, we simply need to walk its top-level directory. If our script is general enough, we will also be able to use it to verify other copy operations as well—e.g., downloaded tar files, hard-drive backups, and so on. In fact, the combination of the cpall script of the prior section and a general tree comparison would provide a portable and scriptable way to copy and verify data sets.

We’ve already studied generic directory tree walkers, but they won’t help us here directly: we need to walk two directories in parallel and inspect common files along the way. Moreover, walking either one of the two directories won’t allow us to spot files and directories that exist only in the other. Something more custom and recursive seems in order here.

Finding Directory Differences

Before we start coding, the first thing we need to clarify is what it means to compare two directory trees. If both trees have exactly the same branch structure and depth, this problem reduces to comparing corresponding files in each tree. In general, though, the trees can have arbitrarily different shapes, depths, and so on.

More generally, the contents of a directory in one tree may have more or fewer entries than the corresponding directory in the other tree. If those differing contents are filenames, there is no corresponding file to compare with; if they are directory names, there is no corresponding branch to descend through. In fact, the only way to detect files and directories that appear in one tree but not the other is to detect differences in each level’s directory.

In other words, a tree comparison algorithm will also have to perform directory comparisons along the way. Because this is a nested and simpler operation, let’s start by coding and debugging a single-directory comparison of filenames in Example 6-11.

Example 6-11. PP4ESystemFiletoolsdirdiff.py

"""
################################################################################
Usage: python dirdiff.py dir1-path dir2-path
Compare two directories to find files that exist in one but not the other.
This version uses the os.listdir function and list difference.  Note that
this script checks only filenames, not file contents--see diffall.py for an
extension that does the latter by comparing .read() results.
################################################################################
"""

import os, sys

def reportdiffs(unique1, unique2, dir1, dir2):
    """
    Generate diffs report for one dir: part of comparedirs output
    """
    if not (unique1 or unique2):
        print('Directory lists are identical')
    else:
        if unique1:
            print('Files unique to', dir1)
            for file in unique1:
                print('...', file)
        if unique2:
            print('Files unique to', dir2)
            for file in unique2:
                print('...', file)

def difference(seq1, seq2):
    """
    Return all items in seq1 only;
    a set(seq1) - set(seq2) would work too, but sets are randomly
    ordered, so any platform-dependent directory order would be lost
    """
    return [item for item in seq1 if item not in seq2]


def comparedirs(dir1, dir2, files1=None, files2=None):
    """
    Compare directory contents, but not actual files;
    may need bytes listdir arg for undecodable filenames on some platforms
    """
    print('Comparing', dir1, 'to', dir2)
    files1  = os.listdir(dir1) if files1 is None else files1
    files2  = os.listdir(dir2) if files2 is None else files2
    unique1 = difference(files1, files2)
    unique2 = difference(files2, files1)
    reportdiffs(unique1, unique2, dir1, dir2)
    return not (unique1 or unique2)               # true if no diffs

def getargs():
    "Args for command-line mode"
    try:
        dir1, dir2 = sys.argv[1:]                 # 2 command-line args
    except:
        print('Usage: dirdiff.py dir1 dir2')
        sys.exit(1)
    else:
        return (dir1, dir2)

if __name__ == '__main__':
    dir1, dir2 = getargs()
    comparedirs(dir1, dir2)

Given listings of names in two directories, this script simply picks out unique names in the first and unique names in the second, and reports any unique names found as differences (that is, files in one directory but not the other). Its comparedirs function returns a true result if no differences were found, which is useful for detecting differences in callers.

Let’s run this script on a few directories; differences are detected and reported as names unique in either passed-in directory pathname. Notice that this is only a structural comparison that just checks names in listings, not file contents (we’ll add the latter in a moment):

C:...PP4ESystemFiletools> dirdiff.py C:	empPP3EExamples copytemp
Comparing C:	empPP3EExamples to copytemp
Directory lists are identical

C:...PP4ESystemFiletools> dirdiff.py C:	empPP3EExamplesPP3ESystem ..
Comparing C:	empPP3EExamplesPP3ESystem to ..
Files unique to C:	empPP3EExamplesPP3ESystem
... App
... Exits
... Media
... moreplus.py
Files unique to ..
... more.pyc
... spam.txt
... Tester
... __init__.pyc

The unique function is the heart of this script: it performs a simple list difference operation. When applied to directories, unique items represent tree differences, and common items are names of files or subdirectories that merit further comparisons or traversals. In fact, in Python 2.4 and later, we could also use the built-in set object type if we don’t care about the order in the results—because sets are not sequences, they would not maintain any original and possibly platform-specific left-to-right order of the directory listings provided by os.listdir. For that reason (and to avoid requiring users to upgrade), we’ll keep using our own comprehension-based function instead of sets.

Finding Tree Differences

We’ve just coded a directory comparison tool that picks out unique files and directories. Now all we need is a tree walker that applies dirdiff at each level to report unique items, explicitly compares the contents of files in common, and descends through directories in common. Example 6-12 fits the bill.

Example 6-12. PP4ESystemFiletoolsdiffall.py

"""
################################################################################
Usage: "python diffall.py dir1 dir2".
Recursive directory tree comparison: report unique files that exist in only
dir1 or dir2, report files of the same name in dir1 and dir2 with differing
contents, report instances of same name but different type in dir1 and dir2,
and do the same for all subdirectories of the same names in and below dir1
and dir2.  A summary of diffs appears at end of output, but search redirected
output for "DIFF" and "unique" strings for further details.  New: (3E) limit
reads to 1M for large files, (3E) catch same name=file/dir, (4E) avoid extra
os.listdir() calls in dirdiff.comparedirs() by passing results here along.
################################################################################
"""

import os, dirdiff
blocksize = 1024 * 1024              # up to 1M per read

def intersect(seq1, seq2):
    """
    Return all items in both seq1 and seq2;
    a set(seq1) & set(seq2) woud work too, but sets are randomly
    ordered, so any platform-dependent directory order would be lost
    """
    return [item for item in seq1 if item in seq2]

def comparetrees(dir1, dir2, diffs, verbose=False):
    """
    Compare all subdirectories and files in two directory trees;
    uses binary files to prevent Unicode decoding and endline transforms,
    as trees might contain arbitrary binary files as well as arbitrary text;
    may need bytes listdir arg for undecodable filenames on some platforms
    """
    # compare file name lists
    print('-' * 20)
    names1 = os.listdir(dir1)
    names2 = os.listdir(dir2)
    if not dirdiff.comparedirs(dir1, dir2, names1, names2):
        diffs.append('unique files at %s - %s' % (dir1, dir2))

    print('Comparing contents')
    common = intersect(names1, names2)
    missed = common[:]

    # compare contents of files in common
    for name in common:
        path1 = os.path.join(dir1, name)
        path2 = os.path.join(dir2, name)
        if os.path.isfile(path1) and os.path.isfile(path2):
            missed.remove(name)
            file1 = open(path1, 'rb')
            file2 = open(path2, 'rb')
            while True:
                bytes1 = file1.read(blocksize)
                bytes2 = file2.read(blocksize)
                if (not bytes1) and (not bytes2):
                    if verbose: print(name, 'matches')
                    break
                if bytes1 != bytes2:
                    diffs.append('files differ at %s - %s' % (path1, path2))
                    print(name, 'DIFFERS')
                    break

    # recur to compare directories in common
    for name in common:
        path1 = os.path.join(dir1, name)
        path2 = os.path.join(dir2, name)
        if os.path.isdir(path1) and os.path.isdir(path2):
            missed.remove(name)
            comparetrees(path1, path2, diffs, verbose)

    # same name but not both files or dirs?
    for name in missed:
        diffs.append('files missed at %s - %s: %s' % (dir1, dir2, name))
        print(name, 'DIFFERS')


if __name__ == '__main__':
    dir1, dir2 = dirdiff.getargs()
    diffs = []
    comparetrees(dir1, dir2, diffs, True)      # changes diffs in-place
    print('=' * 40)                            # walk, report diffs list
    if not diffs:
        print('No diffs found.')
    else:
        print('Diffs found:', len(diffs))
        for diff in diffs: print('-', diff)

At each directory in the tree, this script simply runs the dirdiff tool to detect unique names, and then compares names in common by intersecting directory lists. It uses recursive function calls to traverse the tree and visits subdirectories only after comparing all the files at each level so that the output is more coherent to read (the trace output for subdirectories appears after that for files; it is not intermixed).

Notice the misses list, added in the third edition of this book; it’s very unlikely, but not impossible, that the same name might be a file in one directory and a subdirectory in the other. Also notice the blocksize variable; much like the tree copy script we saw earlier, instead of blindly reading entire files into memory all at once, we limit each read to grab up to 1 MB at a time, just in case any files in the directories are too big to be loaded into available memory. Without this limit, I ran into MemoryError exceptions on some machines with a prior version of this script that read both files all at once, like this:

    bytes1 = open(path1, 'rb').read()
    bytes2 = open(path2, 'rb').read()
    if bytes1 == bytes2: ...

This code was simpler, but is less practical for very large files that can’t fit into your available memory space (consider CD and DVD image files, for example). In the new version’s loop, the file reads return what is left when there is less than 1 MB present or remaining and return empty strings at end-of-file. Files match if all blocks read are the same, and they reach end-of-file at the same time.

We’re also dealing in binary files and byte strings again to suppress Unicode decoding and end-line translations for file content, because trees may contain arbitrary binary and text files. The usual note about changing this to pass byte strings to os.listdir on platforms where filenames may generate Unicode decoding errors applies here as well (e.g. pass dir1.encode()). On some platforms, you may also want to detect and skip certain kinds of special files in order to be fully general, but these were not in my trees, so they are not in my script.

One minor change for the fourth edition of this book: os.listdir results are now gathered just once per subdirectory and passed along, to avoid extra calls in dirdiff—not a huge win, but every cycle counts on the pitifully underpowered netbook I used when writing this edition.

Running the Script

Since we’ve already studied the tree-walking tools this script employs, let’s jump right into a few example runs. When run on identical trees, status messages scroll during the traversal, and a No diffs found. message appears at the end:

C:...PP4ESystemFiletools> diffall.py C:	empPP3EExamples copytemp > diffs.txt
C:...PP4ESystemFiletools> type diffs.txt | more
--------------------
Comparing C:	empPP3EExamples to copytemp
Directory lists are identical
Comparing contents
README-root.txt matches
--------------------
Comparing C:	empPP3EExamplesPP3E to copytempPP3E
Directory lists are identical
Comparing contents
echoEnvironment.pyw matches
LaunchBrowser.pyw matches
Launcher.py matches
Launcher.pyc matches
...over 2,000 more lines omitted...
--------------------
Comparing C:	empPP3EExamplesPP3ETempParts to copytempPP3ETempParts
Directory lists are identical
Comparing contents
109_0237.JPG matches
lawnlake1-jan-03.jpg matches
part-001.txt matches
part-002.html matches
========================================
No diffs found.

I usually run this with the verbose flag passed in as True, and redirect output to a file (for big trees, it produces too much output to scroll through comfortably); use False to watch fewer status messages fly by. To show how differences are reported, we need to generate a few; for simplicity, I’ll manually change a few files scattered about one of the trees, but you could also run a global search-and-replace script like the one we’ll write later in this chapter. While we’re at it, let’s remove a few common files so that directory uniqueness differences show up on the scope, too; the last two removal commands in the following will generate one difference in the same directory in different trees:

C:...PP4ESystemFiletools> notepad copytempPP3EREADME-PP3E.txt
C:...PP4ESystemFiletools> notepad copytempPP3ESystemFiletoolscommands.py
C:...PP4ESystemFiletools> notepad C:	empPP3EExamplesPP3E\__init__.py

C:...PP4ESystemFiletools> del copytempPP3ESystemFiletoolscpall_visitor.py
C:...PP4ESystemFiletools> del copytempPP3ELauncher.py
C:...PP4ESystemFiletools> del C:	empPP3EExamplesPP3EPyGadgets.py

Now, rerun the comparison walker to pick out differences and redirect its output report to a file for easy inspection. The following lists just the parts of the output report that identify differences. In typical use, I inspect the summary at the bottom of the report first, and then search for the strings "DIFF" and "unique" in the report’s text if I need more information about the differences summarized; this interface could be much more user-friendly, of course, but it does the job for me:

C:...PP4ESystemFiletools> diffall.py C:	empPP3EExamples copytemp > diff2.txt
C:...PP4ESystemFiletools> notepad diff2.txt
--------------------
Comparing C:	empPP3EExamples to copytemp
Directory lists are identical
Comparing contents
README-root.txt matches
--------------------
Comparing C:	empPP3EExamplesPP3E to copytempPP3E
Files unique to C:	empPP3EExamplesPP3E
... Launcher.py
Files unique to copytempPP3E
... PyGadgets.py
Comparing contents
echoEnvironment.pyw matches
LaunchBrowser.pyw matches
Launcher.pyc matches
...more omitted...
PyGadgets_bar.pyw matches
README-PP3E.txt DIFFERS
todos.py matches
tounix.py matches
__init__.py DIFFERS
__init__.pyc matches
--------------------
Comparing C:	empPP3EExamplesPP3ESystemFiletools to copytempPP3ESystemFil...
Files unique to C:	empPP3EExamplesPP3ESystemFiletools
... cpall_visitor.py
Comparing contents
commands.py DIFFERS
cpall.py matches
...more omitted...
--------------------
Comparing C:	empPP3EExamplesPP3ETempParts to copytempPP3ETempParts
Directory lists are identical
Comparing contents
109_0237.JPG matches
lawnlake1-jan-03.jpg matches
part-001.txt matches
part-002.html matches
========================================
Diffs found: 5
- unique files at C:	empPP3EExamplesPP3E - copytempPP3E
- files differ at C:	empPP3EExamplesPP3EREADME-PP3E.txt –
         copytempPP3EREADME-PP3E.txt
- files differ at C:	empPP3EExamplesPP3E\__init__.py –
         copytempPP3E\__init__.py
- unique files at C:	empPP3EExamplesPP3ESystemFiletools –
         copytempPP3ESystemFiletools
- files differ at C:	empPP3EExamplesPP3ESystemFiletoolscommands.py –
         copytempPP3ESystemFiletoolscommands.py

I added line breaks and tabs in a few of these output lines to make them fit on this page, but the report is simple to understand. In a tree with 1,430 files and 185 directories, we found five differences—the three files we changed by edits, and the two directories we threw out of sync with the three removal commands.

Verifying Backups

So how does this script placate CD backup paranoia? To double-check my CD writer’s work, I run a command such as the following. I can also use a command like this to find out what has been changed since the last backup. Again, since the CD is “G:” on my machine when plugged in, I provide a path rooted there; use a root such as /dev/cdrom or /mnt/cdrom on Linux:

C:...PP4ESystemFiletools> python diffall.py Examples g:PP3EExamples > diff0226
C:...PP4ESystemFiletools> more diff0226
...output omitted...

The CD spins, the script compares, and a summary of differences appears at the end of the report. For an example of a full difference report, see the file diff*.txt files in the book’s examples distribution package. And to be really sure, I run the following global comparison command to verify the entire book development tree backed up to a memory stick (which works just like a CD in terms of the filesystem):

C:...PP4ESystemFiletools> diffall.py F:writing-backupsfeb-26-10dev
                                 C:UsersmarkStuffBooks4EPP4Edev > diff3.txt
C:...PP4ESystemFiletools> more diff3.txt
--------------------
Comparing F:writing-backupsfeb-26-10dev to C:UsersmarkStuffBooks4EPP4Edev
Directory lists are identical
Comparing contents
ch00.doc DIFFERS
ch01.doc matches
ch02.doc DIFFERS
ch03.doc matches
ch04.doc DIFFERS
ch05.doc matches
ch06.doc DIFFERS
...more output omitted...
--------------------
Comparing F:writing-backupsfeb-26-10devExamplesPP4ESystemFiletools to C:…
Files unique to C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemFiletools
... copytemp
... cpall.py
... diff2.txt
... diff3.txt
... diffall.py
... diffs.txt
... dirdiff.py
... dirdiff.pyc
Comparing contents
bigext-tree.py matches
bigpy-dir.py matches
...more output omitted...
========================================
Diffs found: 7
- files differ at F:writing-backupsfeb-26-10devch00.doc –
         C:UsersmarkStuffBooks4EPP4Edevch00.doc
- files differ at F:writing-backupsfeb-26-10devch02.doc –
         C:UsersmarkStuffBooks4EPP4Edevch02.doc
- files differ at F:writing-backupsfeb-26-10devch04.doc –
         C:UsersmarkStuffBooks4EPP4Edevch04.doc
- files differ at F:writing-backupsfeb-26-10devch06.doc –
         C:UsersmarkStuffBooks4EPP4Edevch06.doc
- files differ at F:writing-backupsfeb-26-10devTOC.txt –
         C:UsersmarkStuffBooks4EPP4EdevTOC.txt
- unique files at F:writing-backupsfeb-26-10devExamplesPP4ESystemFiletools –
         C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemFiletools
- files differ at F:writing-backupsfeb-26-10devExamplesPP4EToolsvisitor.py –
         C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EToolsvisitor.py

This particular run indicates that I’ve added a few examples and changed some chapter files since the last backup; if run immediately after a backup, nothing should show up on diffall radar except for any files that cannot be copied in general. This global comparison can take a few minutes. It performs byte-for-byte comparisons of all chapter files and screenshots, the examples tree, and more, but it’s an accurate and complete verification. Given that this book development tree contained many files, a more manual verification procedure without Python’s help would be utterly impossible.

After writing this script, I also started using it to verify full automated backups of my laptops onto an external hard-drive device. To do so, I run the cpall copy script we wrote earlier in the preceding section of this chapter, and then the comparison script developed here to check results and get a list of files that didn’t copy correctly. The last time I did this, this procedure copied and compared 225,000 files and 15,000 directories in 20 GB of space—not the sort of task that lends itself to manual labor!

Here are the magic incantations on my Windows laptop. f: is a partition on my external hard drive, and you shouldn’t be surprised if each of these commands runs for half an hour or more on currently common hardware. A drag-and-drop copy takes at least as long (assuming it works at all!):

C:...SystemFiletools> cpall.py c: f:   > f:copy-log.txt
C:...SystemFiletools> diffall.py f: c: > f:diff-log.txt

Reporting Differences and Other Ideas

Finally, it’s worth noting that this script still only detects differences in the tree but does not give any further details about individual file differences. In fact, it simply loads and compares the binary contents of corresponding files with string comparisons. It’s a simple yes/no result.

If and when I need more details about how two reported files actually differ, I either edit the files or run the file-comparison command on the host platform (e.g., fc on Windows/DOS, diff or cmp on Unix and Linux). That’s not a portable solution for this last step; but for my purposes, just finding the differences in a 1,400-file tree was much more critical than reporting which lines differ in files flagged in the report.

Of course, since we can always run shell commands in Python, this last step could be automated by spawning a diff or fc command with os.popen as differences are encountered (or after the traversal, by scanning the report summary). The output of these system calls could be displayed verbatim, or parsed for relevant parts.

We also might try to do a bit better here by opening true text files in text mode to ignore line-terminator differences caused by transferring across platforms, but it’s not clear that such differences should be ignored (what if the caller wants to know whether line-end markers have been changed?). For example, after downloading a website with an FTP script we’ll meet in Chapter 13, the diffall script detected a discrepancy between the local copy of a file and the one at the remote server. To probe further, I simply ran some interactive Python code:

>>> a = open('lp2e-updates.html', 'rb').read()
>>> b = open(r'C:MarkWEBSITEpublic_htmllp2e-updates.html', 'rb').read()
>>> a == b
False

This verifies that there really is a binary difference in the downloaded and local versions of the file; to see whether it’s because a Unix or DOS line end snuck into the file, try again in text mode so that line ends are all mapped to the standard character:

>>> a = open('lp2e-updates.html', 'r').read()
>>> b = open(r'C:MarkWEBSITEpublic_htmllp2e-updates.html', 'r').read()
>>> a == b
True

Sure enough; now, to find where the difference is, the following code checks character by character until the first mismatch is found (in binary mode, so we retain the difference):

>>> a = open('lp2e-updates.html', 'rb').read()
>>> b = open(r'C:MarkWEBSITEpublic_htmllp2e-updates.html', 'rb').read()

>>> for (i, (ac, bc)) in enumerate(zip(a, b)):
...     if ac != bc:
...         print(i, repr(ac), repr(bc))
...         break
...
37966 '
' '
'

This means that at byte offset 37,966, there is a in the downloaded file, but a in the local copy. This line has a DOS line end in one and a Unix line end in the other. To see more, print text around the mismatch:

>>> for (i, (ac, bc)) in enumerate(zip(a, b)):
...     if ac != bc:
...         print(i, repr(ac), repr(bc))
...         print(repr(a[i-20:i+20]))
...         print(repr(b[i-20:i+20]))
...         break
...
37966 '
' '
'
're>
def min(*args):
    tmp = list(arg'
're>
def min(*args):
    tmp = list(args'

Apparently, I wound up with a Unix line end at one point in the local copy and a DOS line end in the version I downloaded—the combined effect of the text mode used by the download script itself (which translated to ) and years of edits on both Linux and Windows PDAs and laptops (I probably coded this change on Linux and copied it to my local Windows copy in binary mode). Code such as this could be integrated into the diffall script to make it more intelligent about text files and difference reporting.

Because Python excels at processing files and strings, it’s even possible to go one step further and code a Python equivalent of the fc and diff commands. In fact, much of the work has already been done; the standard library module difflib could make this task simple. See the Python library manual for details and usage examples.

We could also be smarter by avoiding the load and compare steps for files that differ in size, and we might use a smaller block size to reduce the script’s memory requirements. For most trees, such optimizations are unnecessary; reading multimegabyte files into strings is very fast in Python, and garbage collection reclaims the space as you go.

Since such extensions are beyond both this script’s scope and this chapter’s size limits, though, they will have to await the attention of a curious reader (this book doesn’t have formal exercises, but that almost sounds like one, doesn’t it?). For now, let’s move on to explore ways to code one more common directory task: search.

Searching Directory Trees

Engineers love to change things. As I was writing this book, I found it almost irresistible to move and rename directories, variables, and shared modules in the book examples tree whenever I thought I’d stumbled onto a more coherent structure. That was fine early on, but as the tree became more intertwined, this became a maintenance nightmare. Things such as program directory paths and module names were hardcoded all over the place—in package import statements, program startup calls, text notes, configuration files, and more.

One way to repair these references, of course, is to edit every file in the directory by hand, searching each for information that has changed. That’s so tedious as to be utterly impossible in this book’s examples tree, though; the examples of the prior edition contained 186 directories and 1,429 files! Clearly, I needed a way to automate updates after changes. There are a variety of solutions to such goals—from shell commands, to find operations, to custom tree walkers, to general-purpose frameworks. In this and the next section, we’ll explore each option in turn, just as I did while refining solutions to this real-world dilemma.

Greps and Globs and Finds

If you work on Unix-like systems, you probably already know that there is a standard way to search files for strings on such platforms—the command-line program grep and its relatives list all lines in one or more files containing a string or string pattern.^[22] Given that shells expand (i.e., “glob”) filename patterns automatically, a command such as the following will search a single directory’s Python files for a string named on the command line (this uses the grep command installed with the Cygwin Unix-like system for Windows that I described in the prior chapter):

C:...PP4ESystemFiletools> c:cygwiningrep.exe walk *.py
bigext-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):
bigpy-path.py:    for (thisDir, subsHere, filesHere) in os.walk(srcdir):
bigpy-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):

As we’ve seen, we can often accomplish the same within a Python script by running such a shell command with os.system or os.popen. And if we search its results manually, we can also achieve similar results with the Python glob module we met in Chapter 4; it expands a filename pattern into a list of matching filename strings much like a shell:

C:...PP4ESystemFiletools> python
>>> import os
>>> for line in os.popen(r'c:cygwiningrep.exe walk *.py'):
...     print(line, end='')
...
bigext-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):
bigpy-path.py:    for (thisDir, subsHere, filesHere) in os.walk(srcdir):
bigpy-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):

>>> from glob import glob
>>> for filename in glob('*.py'):
...     if 'walk' in open(filename).read():
...         print(filename)
...
bigext-tree.py
bigpy-path.py
bigpy-tree.py

Unfortunately, these tools are generally limited to a single directory. glob can visit multiple directories given the right sort of pattern string, but it’s not a general directory walker of the sort I need to maintain a large examples tree. On Unix-like systems, a find shell command can go the extra mile to traverse an entire directory tree. For instance, the following Unix command line would pinpoint lines and files at and below the current directory that mention the string popen:

find . -name "*.py" -print -exec fgrep popen {} ;

If you happen to have a Unix-like find command on every machine you will ever use, this is one way to process directories.

Rolling Your Own find Module

But if you don’t happen to have a Unix find on all your computers, not to worry—it’s easy to code a portable one in Python. Python itself used to have a find module in its standard library, which I used frequently in the past. Although that module was removed between the second and third editions of this book, the newer os.walk makes writing your own simple. Rather than lamenting the demise of a module, I decided to spend 10 minutes coding a custom equivalent.

Example 6-13 implements a find utility in Python, which collects all matching filenames in a directory tree. Unlike glob.glob, its find.find automatically matches through an entire tree. And unlike the tree walk structure of os.walk, we can treat find.find results as a simple linear group.

Example 6-13. PP4EToolsfind.py

#!/usr/bin/python
"""
################################################################################
Return all files matching a filename pattern at and below a root directory;

custom version of the now deprecated find module in the standard library:
import as "PP4E.Tools.find"; like original, but uses os.walk loop, has no
support for pruning subdirs, and is runnable as a top-level script;

find() is a generator that uses the os.walk() generator to yield just
matching filenames: use findlist() to force results list generation;
################################################################################
"""

import fnmatch, os

def find(pattern, startdir=os.curdir):
    for (thisDir, subsHere, filesHere) in os.walk(startdir):
        for name in subsHere + filesHere:
            if fnmatch.fnmatch(name, pattern):
                fullpath = os.path.join(thisDir, name)
                yield fullpath

def findlist(pattern, startdir=os.curdir, dosort=False):
    matches = list(find(pattern, startdir))
    if dosort: matches.sort()
    return matches

if __name__ == '__main__':
    import sys
    namepattern, startdir = sys.argv[1], sys.argv[2]
    for name in find(namepattern, startdir): print(name)

There’s not much to this file—it’s largely just a minor extension to os.walk—but calling its find function provides the same utility as both the deprecated find standard library module and the Unix utility of the same name. It’s also much more portable, and noticeably easier than repeating all of this file’s code every time you need to perform a find-type search. Because this file is instrumented to be both a script and a library, it can also be both run as a command-line tool or called from other programs.

For instance, to process every Python file in the directory tree rooted one level up from the current working directory, I simply run the following command line from a system console window. Run this yourself to watch its progress; the script’s standard output is piped into the more command to page it here, but it can be piped into any processing program that reads its input from the standard input stream (remember to quote the “*.py” on Unix and Linux shells only, to avoid premature pattern expansion):

C:...PP4ETools> python find.py *.py .. | more
..LaunchBrowser.py
..Launcher.py
..\__init__.py
..Previewattachgui.py
..Previewcustomizegui.py
...more lines omitted...

For more control, run the following sort of Python code from a script or interactive prompt. In this mode, you can apply any operation to the found files that the Python language provides:

C:...PP4ESystemFiletools> python
>>> from PP4E.Tools import find                  # or just import find if in cwd
>>> for filename in find.find('*.py', '..'):
...     if 'walk' in open(filename).read():
...         print(filename)
...
..Launcher.py
..SystemFiletoolsigext-tree.py
..SystemFiletoolsigpy-path.py
..SystemFiletoolsigpy-tree.py
..Toolscleanpyc.py
..Toolsfind.py
..Toolsvisitor.py

Notice how this avoids having to recode the nested loop structure required for os.walk every time you want a list of matching file names; for many use cases, this seems conceptually simpler. Also note that because this finder is a generator function, your script doesn’t have to wait until all matching files have been found and collected; os.walk yields results as it goes, and find.find yields matching files among that set.

Here’s a more complex example of our find module at work: the following system command line lists all Python files in directory C: empPP3E whose names begin with the letter q or x. Note how find returns full directory paths that begin with the start directory specification:

C:...PP4ETools> find.py [qx]*.py C:	empPP3E
C:	empPP3EExamplesPP3EDatabaseSQLscriptsquerydb.py
C:	empPP3EExamplesPP3EGuiToolsqueuetest-gui-class.py
C:	empPP3EExamplesPP3EGuiToolsqueuetest-gui.py
C:	empPP3EExamplesPP3EGuiTourquitter.py
C:	empPP3EExamplesPP3EInternetOtherGrailQuestion.py
C:	empPP3EExamplesPP3EInternetOtherXMLxmlrpc.py
C:	empPP3EExamplesPP3ESystemThreadsqueuetest.py

And here’s some Python code that does the same find but also extracts base names and file sizes for each file found:

C:...PP4ETools> python
>>> import os
>>> from find import find
>>> for name in find('[qx]*.py', r'C:	empPP3E'):
...     print(os.path.basename(name), os.path.getsize(name))
...
querydb.py 635
queuetest-gui-class.py 1152
queuetest-gui.py 963
quitter.py 801
Question.py 817
xmlrpc.py 705
queuetest.py 1273

The fnmatch module

To achieve such code economy, the find module calls os.walk to walk the tree and simply yields matching filenames along the way. New here, though, is the fnmatch module—yet another Python standard library module that performs Unix-like pattern matching against filenames. This module supports common operators in name pattern strings: * to match any number of characters, ? to match any single character, and [...] and [!...] to match any character inside the bracket pairs or not; other characters match themselves. Unlike the re module, fnmatch supports only common Unix shell matching operators, not full-blown regular expression patterns; we’ll see why this distinction matters in Chapter 19.

Interestingly, Python’s glob.glob function also uses the fnmatch module to match names: it combines os.listdir and fnmatch to match in directories in much the same way our find.find combines os.walk and fnmatch to match in trees (though os.walk ultimately uses os.listdir as well). One ramification of all this is that you can pass byte strings for both pattern and start-directory to find.find if you need to suppress Unicode filename decoding, just as you can for os.walk and glob.glob; you’ll receive byte strings for filenames in the result. See Chapter 4 for more details on Unicode filenames.

By comparison, find.find with just “*” for its name pattern is also roughly equivalent to platform-specific directory tree listing shell commands such as dir /B /S on DOS and Windows. Since all files match “*”, this just exhaustively generates all the file names in a tree with a single traversal. Because we can usually run such shell commands in a Python script with os.popen, the following do the same work, but the first is inherently nonportable and must start up a separate program along the way:

>>> import os
>>> for line in os.popen('dir /B /S'): print(line, end='')

>>> from PP4E.Tools.find import find
>>> for name in find(pattern='*', startdir='.'): print(name)

Watch for this utility to show up in action later in this chapter and book, including an arguably strong showing in the next section and a cameo appearance in the Grep dialog of Chapter 11’s PyEdit text editor GUI, where it will serve a central role in a threaded external files search tool. The standard library’s find module may be gone, but it need not be forgotten.

Note

In fact, you must pass a bytes pattern string for a bytes filename to fnnmatch (or pass both as str), because the re pattern matching module it uses does not allow the string types of subject and pattern to be mixed. This rule is inherited by our find.find for directory and pattern. See Chapter 19 for more on re.

Curiously, the fnmatch module in Python 3.1 also converts a bytes pattern string to and from Unicode str in order to perform internal text processing, using the Latin-1 encoding. This suffices for many contexts, but may not be entirely sound for some encodings which do not map to Latin-1 cleanly. sys.getfilesystemencoding might be a better encoding choice in such contexts, as this reflects the underlying file system’s constraints (as we learned in Chapter 4, sys.getdefaultencoding reflects file content, not names).

In the absence of bytes, os.walk assumes filenames follow the platform’s convention and does not ignore decoding errors triggered by os.listdir. In the “grep” utility of Chapter 11’s PyEdit, this picture is further clouded by the fact that a str pattern string from a GUI would have to be encoded to bytes using a potentially inappropriate encoding for some files present. See fnmatch.py and os.py in Python’s library and the Python library manual for more details. Unicode can be a very subtle affair.

Cleaning Up Bytecode Files

The find module of the prior section isn’t quite the general string searcher we’re after, but it’s an important first step—it collects files that we can then search in an automated script. In fact, the act of collecting matching files in a tree is enough by itself to support a wide variety of day-to-day system tasks.

For example, one of the other common tasks I perform on a regular basis is removing all the bytecode files in a tree. Because these are not always portable across major Python releases, it’s usually a good idea to ship programs without them and let Python create new ones on first imports. Now that we’re expert os.walk users, we could cut out the middleman and use it directly. Example 6-14 codes a portable and general command-line tool, with support for arguments, exception processing, tracing, and list-only mode.

Example 6-14. PP4EToolscleanpyc.py

"""
delete all .pyc bytecode files in a directory tree: use the
command line arg as root if given, else current working dir
"""

import os, sys
findonly = False
rootdir = os.getcwd() if len(sys.argv) == 1 else sys.argv[1]

found = removed = 0
for (thisDirLevel, subsHere, filesHere) in os.walk(rootdir):
    for filename in filesHere:
        if filename.endswith('.pyc'):
            fullname = os.path.join(thisDirLevel, filename)
            print('=>', fullname)
            if not findonly:
                try:
                    os.remove(fullname)
                    removed += 1
                except:
                    type, inst = sys.exc_info()[:2]
                    print('*'*4, 'Failed:', filename, type, inst)
            found += 1

print('Found', found, 'files, removed', removed)

When run, this script walks a directory tree (the CWD by default, or else one passed in on the command line), deleting any and all bytecode files along the way:

C:...ExamplesPP4E> Toolscleanpyc.py
=> C:UsersmarkStuffBooks4EPP4EdevExamplesPP4E\__init__.pyc
=> C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EPreviewinitdata.pyc
=> C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EPreviewmake_db_file.pyc
=> C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EPreviewmanager.pyc
=> C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EPreviewperson.pyc
...more lines here...
Found 24 files, removed 24

C:...PP4ETools> cleanpyc.py .
=> .find.pyc
=> .visitor.pyc
=> .\__init__.pyc
Found 3 files, removed 3

This script works, but it’s a bit more manual and code-y than it needs to be. In fact, now that we also know about find operations, writing scripts based upon them is almost trivial when we just need to match filenames. Example 6-15, for instance, falls back on spawning shell find commands if you have them.

Example 6-15. PP4EToolscleanpyc-find-shell.py

"""
find and delete all "*.pyc" bytecode files at and below the directory
named on the command-line; assumes a nonportable Unix-like find command
"""

import os, sys

rundir = sys.argv[1]
if sys.platform[:3] == 'win':
    findcmd = r'c:cygwininfind %s -name "*.pyc" -print' % rundir
else:
    findcmd = 'find %s -name "*.pyc" -print' % rundir
print(findcmd)

count = 0
for fileline in os.popen(findcmd):                  # for all result lines
    count += 1                                      # have 
 at the end
    print(fileline, end='')
    os.remove(fileline.rstrip())

print('Removed %d .pyc files' % count)

When run, files returned by the shell command are removed:

C:...PP4ETools> cleanpyc-find-shell.py .
c:cygwininfind . -name "*.pyc" -print
./find.pyc
./visitor.pyc
./__init__.pyc
Removed 3 .pyc files

This script uses os.popen to collect the output of a Cygwin find program installed on one of my Windows computers, or else the standard find tool on the Linux side. It’s also completely nonportable to Windows machines that don’t have the Unix-like find program installed, and that includes other computers of my own (not to mention those throughout most of the world at large). As we’ve seen, spawning shell commands also incurs performance penalties for starting a new program.

We can do much better on the portability and performance fronts and still retain code simplicity, by applying the find tool we wrote in Python in the prior section. The new script is shown in Example 6-16.

Example 6-16. PP4EToolscleanpyc-find-py.py

"""
find and delete all "*.pyc" bytecode files at and below the directory
named on the command-line; this uses a Python-coded find utility, and
so is portable; run this to delete .pyc's from an old Python release;
"""

import os, sys, find   # here, gets Tools.find

count = 0
for filename in find.find('*.pyc', sys.argv[1]):
    count += 1
    print(filename)
    os.remove(filename)

print('Removed %d .pyc files' % count)

When run, all bytecode files in the tree rooted at the passed-in directory name are removed as before; this time, though, our script works just about everywhere Python does:

C:...PP4ETools> cleanpyc-find-py.py .
.find.pyc
.visitor.pyc
.\__init__.pyc
Removed 3 .pyc files

This works portably, and it avoids external program startup costs. But find is really just half the story—it collects files matching a name pattern but doesn’t search their content. Although extra code can add such searching to a find’s result, a more manual approach can allow us to tap into the search process more directly. The next section shows how.

A Python Tree Searcher

After experimenting with greps and globs and finds, in the end, to help ease the task of performing global searches on all platforms I might ever use, I wound up coding a task-specific Python script to do most of the work for me. Example 6-17 employs the following standard Python tools that we met in the preceding chapters: os.walk to visit files in a directory, os.path.splitext to skip over files with binary-type extensions, and os.path.join to portably combine a directory path and filename.

Because it’s pure Python code, it can be run the same way on both Linux and Windows. In fact, it should work on any computer where Python has been installed. Moreover, because it uses direct system calls, it will likely be faster than approaches that rely on underlying shell commands.

Example 6-17. PP4EToolssearch_all.py

"""
################################################################################
Use: "python ...Toolssearch_all.py dir string".
Search all files at and below a named directory for a string; uses the
os.walk interface, rather than doing a find.find to collect names first;
similar to calling visitfile for each find.find result for "*" pattern;
################################################################################
"""

import os, sys
listonly = False
textexts = ['.py', '.pyw', '.txt', '.c', '.h']             # ignore binary files

def searcher(startdir, searchkey):
    global fcount, vcount
    fcount = vcount = 0
    for (thisDir, dirsHere, filesHere) in os.walk(startdir):
        for fname in filesHere:                            # do non-dir files here
            fpath = os.path.join(thisDir, fname)           # fnames have no dirpath
            visitfile(fpath, searchkey)

def visitfile(fpath, searchkey):                           # for each non-dir file
    global fcount, vcount                                  # search for string
    print(vcount+1, '=>', fpath)                           # skip protected files
    try:
        if not listonly:
            if os.path.splitext(fpath)[1] not in textexts:
                print('Skipping', fpath)
            elif searchkey in open(fpath).read():
                input('%s has %s' % (fpath, searchkey))
                fcount += 1
    except:
        print('Failed:', fpath, sys.exc_info()[0])
    vcount += 1

if __name__ == '__main__':
    searcher(sys.argv[1], sys.argv[2])
    print('Found in %d files, visited %d' % (fcount, vcount))

Operationally, this script works roughly the same as calling its visitfile function for every result generated by our find.find tool with a pattern of “*”; but because this version is specific to searching content it can better tailored for its goal. Really, this equivalence holds only because a “*” pattern invokes an exhaustive traversal in find.find, and that’s all that this new script’s searcher function does. The finder is good at selecting specific file types, but this script benefits from a more custom single traversal.

When run standalone, the search key is passed on the command line; when imported, clients call this module’s searcher function directly. For example, to search (that is, grep) for all appearances of a string in the book examples tree, I run a command line like this in a DOS or Unix shell:

C:\PP4E> Toolssearch_all.py . mimetypes
1 => .LaunchBrowser.py
2 => .Launcher.py
3 => .Launch_PyDemos.pyw
4 => .Launch_PyGadgets_bar.pyw
5 => .\__init__.py
6 => .\__init__.pyc
Skipping .\__init__.pyc
7 => .Previewattachgui.py
8 => .Previewob.pkl
Skipping .Previewob.pkl
...more lines omitted: pauses for Enter key press at matches...
Found in 2 files, visited 184

The script lists each file it checks as it goes, tells you which files it is skipping (names that end in extensions not listed in the variable textexts that imply binary data), and pauses for an Enter key press each time it announces a file containing the search string. The search_all script works the same way when it is imported rather than run, but there is no final statistics output line (fcount and vcount live in the module and so would have to be imported to be inspected here):

C:...PP4EdevExamplesPP4E> python
>>> import Tools.search_all
>>> search_all.searcher(r'C:	empPP3EExamples', 'mimetypes')
...more lines omitted: 8 pauses for Enter key press along the way...
>>> search_all.fcount, search_all.vcount     # matches, files
(8, 1429)

However launched, this script tracks down all references to a string in an entire directory tree: a name of a changed book examples file, object, or directory, for instance. It’s exactly what I was looking for—or at least I thought so, until further deliberation drove me to seek more complete and better structured solutions, the topic of the next section.

Note

Be sure to also see the coverage of regular expressions in Chapter 19. The search_all script here searches for a simple string in each file with the in string membership expression, but it would be trivial to extend it to search for a regular expression pattern match instead (roughly, just replace in with a call to a regular expression object’s search method). Of course, such a mutation will be much more trivial after we’ve learned how.

Also notice the textexts list in Example 6-17, which attempts to list all possible text file types: it would be more general and robust to use the mimetypes logic we will meet near the end of this chapter in order to guess file content type from its name, but the skips list provides more control and sufficed for the trees I used this script against.

Finally note that for simplicity many of the directory searches in this chapter assume that text is encoded per the underlying platform’s Unicode default. They could open text in binary mode to avoid decoding errors, but searches might then be inaccurate because of encoding scheme differences in the raw encoded bytes. To see how to do better, watch for the “grep” utility in Chapter 11’s PyEdit GUI, which will apply an encoding name to all the files in a searched tree and ignore those text or binary files that fail to decode.

Visitor: Walking Directories “++”

Laziness is the mother of many a framework. Armed with the portable search_all script from Example 6-17, I was able to better pinpoint files to be edited every time I changed the book examples tree content or structure. At least initially, in one window I ran search_all to pick out suspicious files and edited each along the way by hand in another window.

Pretty soon, though, this became tedious, too. Manually typing filenames into editor commands is no fun, especially when the number of files to edit is large. Since I occasionally have better things to do than manually start dozens of text editor sessions, I started looking for a way to automatically run an editor on each suspicious file.

Unfortunately, search_all simply prints results to the screen. Although that text could be intercepted with os.popen and parsed by another program, a more direct approach that spawns edit sessions during the search may be simpler. That would require major changes to the tree search script as currently coded, though, and make it useful for just one specific purpose. At this point, three thoughts came to mind:

Redundancy: After writing a few directory walking utilities, it became clear that I was rewriting the same sort of code over and over again. Traversals could be even further simplified by wrapping common details for reuse. Although the os.walk tool avoids having to write recursive functions, its model tends to foster redundant operations and code (e.g., directory name joins, tracing prints).
Extensibility: Past experience informed me that it would be better in the long run to add features to a general directory searcher as external components, rather than changing the original script itself. Because editing files was just one possible extension (what about automating text replacements, too?), a more general, customizable, and reusable approach seemed the way to go. Although os.walk is straightforward to use, its nested loop-based structure doesn’t quite lend itself to customization the way a class can.
Encapsulation: Based on past experience, I also knew that it’s a generally good idea to insulate programs from implementation details as much as possible. While os.walk hides the details of recursive traversal, it still imposes a very specific interface on its clients, which is prone to change over time. Indeed it has—as I’ll explain further at the end of this section, one of Python’s tree walkers was removed altogether in 3.X, instantly breaking code that relied upon it. It would be better to hide such dependencies behind a more neutral interface, so that clients won’t break as our needs change.

Of course, if you’ve studied Python in any depth, you know that all these goals point to using an object-oriented framework for traversals and searching. Example 6-18 is a concrete realization of these goals. It exports a general FileVisitor class that mostly just wraps os.walk for easier use and extension, as well as a generic SearchVisitor class that generalizes the notion of directory searches.

By itself, SearchVisitor simply does what search_all did, but it also opens up the search process to customization—bits of its behavior can be modified by overloading its methods in subclasses. Moreover, its core search logic can be reused everywhere we need to search. Simply define a subclass that adds extensions for a specific task. The same goes for FileVisitor—by redefining its methods and using its attributes, we can tap into tree search using OOP coding techniques. As is usual in programming, once you repeat tactical tasks often enough, they tend to inspire this kind of strategic thinking.

Example 6-18. PP4EToolsvisitor.py

"""
####################################################################################
Test: "python ...Toolsvisitor.py dir testmask [string]".  Uses classes and
subclasses to wrap some of the details of os.walk call usage to walk and search;
testmask is an integer bitmask with 1 bit per available self-test; see also:
visitor_*/.py subclasses use cases; frameworks should generally use__X pseudo
private names, but all names here are exported for use in subclasses and clients;
redefine reset to support multiple independent walks that require subclass updates;
####################################################################################
"""

import os, sys

class FileVisitor:
    """
    Visits all nondirectory files below startDir (default '.'),
    override visit* methods to provide custom file/dir handlers;
    context arg/attribute is optional subclass-specific state;
    trace switch: 0 is silent, 1 is directories, 2 adds files
    """
    def __init__(self, context=None, trace=2):
        self.fcount   = 0
        self.dcount   = 0
        self.context  = context
        self.trace    = trace

    def run(self, startDir=os.curdir, reset=True):
        if reset: self.reset()
        for (thisDir, dirsHere, filesHere) in os.walk(startDir):
            self.visitdir(thisDir)
            for fname in filesHere:                          # for non-dir files
                fpath = os.path.join(thisDir, fname)         # fnames have no path
                self.visitfile(fpath)

    def reset(self):                                         # to reuse walker
        self.fcount = self.dcount = 0                        # for independent walks

    def visitdir(self, dirpath):                             # called for each dir
        self.dcount += 1                                     # override or extend me
        if self.trace > 0: print(dirpath, '...')

    def visitfile(self, filepath):                           # called for each file
        self.fcount += 1                                     # override or extend me
        if self.trace > 1: print(self.fcount, '=>', filepath)


class SearchVisitor(FileVisitor):
    """
    Search files at and below startDir for a string;
    subclass: redefine visitmatch, extension lists, candidate as needed;
    subclasses can use testexts to specify file types to search (but can
    also redefine candidate to use mimetypes for text content: see ahead)
    """

    skipexts = []
    testexts = ['.txt', '.py', '.pyw', '.html', '.c', '.h']  # search these exts
   #skipexts = ['.gif', '.jpg', '.pyc', '.o', '.a', '.exe']  # or skip these exts

    def __init__(self, searchkey, trace=2):
        FileVisitor.__init__(self, searchkey, trace)
        self.scount = 0

    def reset(self):                                         # on independent walks
        self.scount = 0

    def candidate(self, fname):                              # redef for mimetypes
        ext = os.path.splitext(fname)[1]
        if self.testexts:
            return ext in self.testexts                      # in test list
        else:                                                # or not in skip list
            return ext not in self.skipexts

    def visitfile(self, fname):                              # test for a match
        FileVisitor.visitfile(self, fname)
        if not self.candidate(fname):
            if self.trace > 0: print('Skipping', fname)
        else:
            text = open(fname).read()                        # 'rb' if undecodable
            if self.context in text:                         # or text.find() != −1
                self.visitmatch(fname, text)
                self.scount += 1

    def visitmatch(self, fname, text):                       # process a match
        print('%s has %s' % (fname, self.context))           # override me lower


if __name__ == '__main__':
    # self-test logic
    dolist   = 1
    dosearch = 2    # 3=do list and search
    donext   = 4    # when next test added

    def selftest(testmask):
        if testmask & dolist:
           visitor = FileVisitor(trace=2)
           visitor.run(sys.argv[2])
           print('Visited %d files and %d dirs' % (visitor.fcount, visitor.dcount))

        if testmask & dosearch:
           visitor = SearchVisitor(sys.argv[3], trace=0)
           visitor.run(sys.argv[2])
           print('Found in %d files, visited %d' % (visitor.scount, visitor.fcount))

    selftest(int(sys.argv[1]))    # e.g., 3 = dolist | dosearch

This module primarily serves to export classes for external use, but it does something useful when run standalone, too. If you invoke it as a script with a test mask of 1 and a root directory name, it makes and runs a FileVisitor object and prints an exhaustive listing of every file and directory at and below the root:

C:...PP4ETools> visitor.py 1 C:	empPP3EExamples
C:	empPP3EExamples ...
1 => C:	empPP3EExamplesREADME-root.txt
C:	empPP3EExamplesPP3E ...
2 => C:	empPP3EExamplesPP3EechoEnvironment.pyw
3 => C:	empPP3EExamplesPP3ELaunchBrowser.pyw
4 => C:	empPP3EExamplesPP3ELauncher.py
5 => C:	empPP3EExamplesPP3ELauncher.pyc
...more output omitted (pipe into more or a file)...
1424 => C:	empPP3EExamplesPP3ESystemThreads	hread-count.py
1425 => C:	empPP3EExamplesPP3ESystemThreads	hread1.py
C:	empPP3EExamplesPP3ETempParts ...
1426 => C:	empPP3EExamplesPP3ETempParts109_0237.JPG
1427 => C:	empPP3EExamplesPP3ETempPartslawnlake1-jan-03.jpg
1428 => C:	empPP3EExamplesPP3ETempPartspart-001.txt
1429 => C:	empPP3EExamplesPP3ETempPartspart-002.html
Visited 1429 files and 186 dirs

If you instead invoke this script with a 2 as its first command-line argument, it makes and runs a SearchVisitor object using the third argument as the search key. This form is similar to running the search_all.py script we met earlier, but it simply reports each matching file without pausing:

C:...PP4ETools> visitor.py 2 C:	empPP3EExamples mimetypes
C:	empPP3EExamplesPP3EextrasLosAlamosAdvancedClassday1-systemdata.txt ha
s mimetypes
C:	empPP3EExamplesPP3EInternetEmailmailtoolsmailParser.py has mimetypes
C:	empPP3EExamplesPP3EInternetEmailmailtoolsmailSender.py has mimetypes
C:	empPP3EExamplesPP3EInternetFtpmirrordownloadflat.py has mimetypes
C:	empPP3EExamplesPP3EInternetFtpmirrordownloadflat_modular.py has mimet
ypes
C:	empPP3EExamplesPP3EInternetFtpmirrorftptools.py has mimetypes
C:	empPP3EExamplesPP3EInternetFtpmirroruploadflat.py has mimetypes
C:	empPP3EExamplesPP3ESystemMediaplayfile.py has mimetypes
Found in 8 files, visited 1429

Technically, passing this script a first argument of 3 runs both a FileVisitor and a SearchVisitor (two separate traversals are performed). The first argument is really used as a bit mask to select one or more supported self-tests; if a test’s bit is on in the binary value of the argument, the test will be run. Because 3 is 011 in binary, it selects both a search (010) and a listing (001). In a more user-friendly system, we might want to be more symbolic about that (e.g., check for -search and -list arguments), but bit masks work just as well for this script’s scope.

As usual, this module can also be used interactively. The following is one way to determine how many files and directories you have in specific directories; the last command walks over your entire drive (after a generally noticeable delay!). See also the “biggest file” example at the start of this chapter for issues such as potential repeat visits not handled by this walker:

C:...PP4ETools> python
>>> from visitor import FileVisitor
>>> V = FileVisitor(trace=0)
>>> V.run(r'C:	empPP3EExamples')
>>> V.dcount, V.fcount
(186, 1429)

>>> V.run('..')                        # independent walk (reset counts)
>>> V.dcount, V.fcount
(19, 181)

>>> V.run('..', reset=False)           # accumulative walk (keep counts)
>>> V.dcount, V.fcount
(38, 362)

>>> V = FileVisitor(trace=0)           # new independent walker (own counts)
>>> V.run(r'C:')                     # entire drive: try '/' on Unix-en
>>> V.dcount, V.fcount
(24992, 198585)

Although the visitor module is useful by itself for listing and searching trees, it was really designed to be extended. In the rest of this section, let’s quickly step through a handful of visitor clients which add more specific tree operations, using normal OO customization techniques.

Editing Files in Directory Trees (Visitor)

After genericizing tree traversals and searches, it’s easy to add automatic file editing in a brand-new, separate component. Example 6-19 defines a new EditVisitor class that simply customizes the visitmatch method of the SearchVisitor class to open a text editor on the matched file. Yes, this is the complete program—it needs to do something special only when visiting matched files, and so it needs to provide only that behavior. The rest of the traversal and search logic is unchanged and inherited.

Example 6-19. PP4EToolsvisitor_edit.py

"""
Use: "python ...Toolsvisitor_edit.py string rootdir?".
Add auto-editor startup to SearchVisitor in an external subclass component;
Automatically pops up an editor on each file containing string as it traverses;
can also use editor='edit' or 'notepad' on Windows; to use texteditor from
later in the book, try r'python GuiTextEditor	extEditor.py'; could also
send a search command to go to the first match on start in some editors;
"""

import os, sys
from visitor import SearchVisitor

class EditVisitor(SearchVisitor):
    """
    edit files at and below startDir having string
    """
    editor = r'C:cygwininvim-nox.exe'  # ymmv!

    def visitmatch(self, fpathname, text):
        os.system('%s %s' % (self.editor, fpathname))

if __name__  == '__main__':
    visitor = EditVisitor(sys.argv[1])
    visitor.run('.' if len(sys.argv) < 3 else sys.argv[2])
    print('Edited %d files, visited %d' % (visitor.scount, visitor.fcount))

When we make and run an EditVisitor, a text editor is started with the os.system command-line spawn call, which usually blocks its caller until the spawned program finishes. As coded, when run on my machines, each time this script finds a matched file during the traversal, it starts up the vi text editor within the console window where the script was started; exiting the editor resumes the tree walk.

Let’s find and edit some files. When run as a script, we pass this program the search string as a command argument (here, the string mimetypes is the search key). The root directory passed to the run method is either the second argument or “.” (the current run directory) by default. Traversal status messages show up in the console, but each matched file now automatically pops up in a text editor along the way. In the following, the editor is started eight times—try this with an editor and tree of your own to get a better feel for how it works:

C:...PP4ETools> visitor_edit.py mimetypes C:	empPP3EExamples
C:	empPP3EExamples ...
1 => C:	empPP3EExamplesREADME-root.txt
C:	empPP3EExamplesPP3E ...
2 => C:	empPP3EExamplesPP3EechoEnvironment.pyw
3 => C:	empPP3EExamplesPP3ELaunchBrowser.pyw
4 => C:	empPP3EExamplesPP3ELauncher.py
5 => C:	empPP3EExamplesPP3ELauncher.pyc
Skipping C:	empPP3EExamplesPP3ELauncher.pyc
...more output omitted...
1427 => C:	empPP3EExamplesPP3ETempPartslawnlake1-jan-03.jpg
Skipping C:	empPP3EExamplesPP3ETempPartslawnlake1-jan-03.jpg
1428 => C:	empPP3EExamplesPP3ETempPartspart-001.txt
1429 => C:	empPP3EExamplesPP3ETempPartspart-002.html
Edited 8 files, visited 1429

This, finally, is the exact tool I was looking for to simplify global book examples tree maintenance. After major changes to things such as shared modules and file and directory names, I run this script on the examples root directory with an appropriate search string and edit any files it pops up as needed. I still need to change files by hand in the editor, but that’s often safer than blind global replacements.

Global Replacements in Directory Trees (Visitor)

But since I brought it up: given a general tree traversal class, it’s easy to code a global search-and-replace subclass, too. The ReplaceVisitor class in Example 6-20 is a SearchVisitor subclass that customizes the visitfile method to globally replace any appearances of one string with another, in all text files at and below a root directory. It also collects the names of all files that were changed in a list just in case you wish to go through and verify the automatic edits applied (a text editor could be automatically popped up on each changed file, for instance).

Example 6-20. PP4EToolsvisitor_replace.py

"""
Use: "python ...Toolsvisitor_replace.py rootdir fromStr toStr".
Does global search-and-replace in all files in a directory tree: replaces
fromStr with toStr in all text files; this is powerful but dangerous!!
visitor_edit.py runs an editor for you to verify and make changes, and so
is safer; use visitor_collect.py to simply collect matched files list;
listonly mode here is similar to both SearchVisitor and CollectVisitor;
"""

import sys
from visitor import SearchVisitor

class ReplaceVisitor(SearchVisitor):
    """
    Change fromStr to toStr in files at and below startDir;
    files changed available in obj.changed list after a run
    """
    def __init__(self, fromStr, toStr, listOnly=False, trace=0):
        self.changed  = []
        self.toStr    = toStr
        self.listOnly = listOnly
        SearchVisitor.__init__(self, fromStr, trace)

    def visitmatch(self, fname, text):
        self.changed.append(fname)
        if not self.listOnly:
            fromStr, toStr = self.context, self.toStr
            text = text.replace(fromStr, toStr)
            open(fname, 'w').write(text)

if __name__  == '__main__':
    listonly = input('List only?') == 'y'
    visitor  = ReplaceVisitor(sys.argv[2], sys.argv[3], listonly)
    if listonly or input('Proceed with changes?') == 'y':
        visitor.run(startDir=sys.argv[1])
        action = 'Changed' if not listonly else 'Found'
        print('Visited %d files'  % visitor.fcount)
        print(action, '%d files:' % len(visitor.changed))
        for fname in visitor.changed: print(fname)

To run this script over a directory tree, run the following sort of command line with appropriate “from” and “to” strings. On my shockingly underpowered netbook machine, doing this on a 1429-file tree and changing 101 files along the way takes roughly three seconds of real clock time when the system isn’t particularly busy.

C:...PP4ETools> visitor_replace.py C:	empPP3EExamples PP3E PP4E
List only?y
Visited 1429 files
Found 101 files:
C:	empPP3EExamplesREADME-root.txt
C:	empPP3EExamplesPP3EechoEnvironment.pyw
C:	empPP3EExamplesPP3ELauncher.py
...more matching filenames omitted...

C:...PP4ETools> visitor_replace.py C:	empPP3EExamples PP3E PP4E
List only?n
Proceed with changes?y
Visited 1429 files
Changed 101 files:
C:	empPP3EExamplesREADME-root.txt
C:	empPP3EExamplesPP3EechoEnvironment.pyw
C:	empPP3EExamplesPP3ELauncher.py
...more changed filenames omitted...

C:...PP4ETools> visitor_replace.py C:	empPP3EExamples PP3E PP4E
List only?n
Proceed with changes?y
Visited 1429 files
Changed 0 files:

Naturally, we can also check our work by running the visitor script (and SearchVisitor superclass):

C:...PP4ETools> visitor.py 2 C:	empPP3EExamples PP3E
Found in 0 files, visited 1429

C:...PP4ETools> visitor.py 2 C:	empPP3EExamples PP4E
C:	empPP3EExamplesREADME-root.txt has PP4E
C:	empPP3EExamplesPP3EechoEnvironment.pyw has PP4E
C:	empPP3EExamplesPP3ELauncher.py has PP4E
...more matching filenames omitted...
Found in 101 files, visited 1429

This is both wildly powerful and dangerous. If the string to be replaced can show up in places you didn’t anticipate, you might just ruin an entire tree of files by running the ReplaceVisitor object defined here. On the other hand, if the string is something very specific, this object can obviate the need to manually edit suspicious files. For instance, website addresses in HTML files are likely too specific to show up in other places by chance.

Counting Source Code Lines (Visitor)

The two preceding visitor module clients were both search-oriented, but it’s just as easy to extend the basic walker class for more specific goals. Example 6-21, for instance, extends FileVisitor to count the number of lines in program source code files of various types throughout an entire tree. The effect is much like calling the visitfile method of this class for each filename returned by the find tool we wrote earlier in this chapter, but the OO structure here is arguably more flexible and extensible.

Example 6-21. PP4EToolsvisitor_sloc.py

"""
Count lines among all program source files in a tree named on the command
line, and report totals grouped by file types (extension).  A simple SLOC
(source lines of code) metric: skip blank and comment lines if desired.
"""

import sys, pprint, os
from visitor import FileVisitor

class LinesByType(FileVisitor):
    srcExts = [] # define in subclass

    def __init__(self, trace=1):
        FileVisitor.__init__(self, trace=trace)
        self.srcLines = self.srcFiles = 0
        self.extSums = {ext: dict(files=0, lines=0) for ext in self.srcExts}

    def visitsource(self, fpath, ext):
        if self.trace > 0: print(os.path.basename(fpath))
        lines = len(open(fpath, 'rb').readlines())
        self.srcFiles += 1
        self.srcLines += lines
        self.extSums[ext]['files'] += 1
        self.extSums[ext]['lines'] += lines

    def visitfile(self, filepath):
        FileVisitor.visitfile(self, filepath)
        for ext in self.srcExts:
            if filepath.endswith(ext):
                self.visitsource(filepath, ext)
                break

class PyLines(LinesByType):
    srcExts = ['.py', '.pyw']   # just python files

class SourceLines(LinesByType):
    srcExts = ['.py', '.pyw', '.cgi', '.html', '.c', '.cxx', '.h', '.i']

if __name__ == '__main__':
    walker = SourceLines()
    walker.run(sys.argv[1])
    print('Visited %d files and %d dirs' % (walker.fcount, walker.dcount))
    print('-'*80)
    print('Source files=>%d, lines=>%d'  % (walker.srcFiles, walker.srcLines))
    print('By Types:')
    pprint.pprint(walker.extSums)

    print('
Check sums:', end=' ')
    print(sum(x['lines'] for x in walker.extSums.values()), end=' ')
    print(sum(x['files'] for x in walker.extSums.values()))

    print('
Python only walk:')
    walker = PyLines(trace=0)
    walker.run(sys.argv[1])
    pprint.pprint(walker.extSums)

When run as a script, we get trace messages during the walk (omitted here to save space), and a report with line counts grouped by file type. Run this on trees of your own to watch its progress; my tree has 907 source files and 48K source lines, including 783 files and 34K lines of “.py” Python code:

C:...PP4ETools> visitor_sloc.py C:	empPP3EExamples
Visited 1429 files and 186 dirs
--------------------------------------------------------------------------------
Source files=>907, lines=>48047
By Types:
{'.c': {'files': 45, 'lines': 7370},
 '.cgi': {'files': 5, 'lines': 122},
 '.cxx': {'files': 4, 'lines': 2278},
 '.h': {'files': 7, 'lines': 297},
 '.html': {'files': 48, 'lines': 2830},
 '.i': {'files': 4, 'lines': 49},
 '.py': {'files': 783, 'lines': 34601},
 '.pyw': {'files': 11, 'lines': 500}}

Check sums: 48047 907

Python only walk:
{'.py': {'files': 783, 'lines': 34601}, '.pyw': {'files': 11, 'lines': 500}}

Recoding Copies with Classes (Visitor)

Let’s peek at one more visitor use case. When I first wrote the cpall.py script earlier in this chapter, I couldn’t see a way that the visitor class hierarchy we met earlier would help. Two directories needed to be traversed in parallel (the original and the copy), and visitor is based on walking just one tree with os.walk. There seemed no easy way to keep track of where the script was in the copy directory.

The trick I eventually stumbled onto is not to keep track at all. Instead, the script in Example 6-22 simply replaces the “from” directory path string with the “to” directory path string, at the front of all directory names and pathnames passed in from os.walk. The results of the string replacements are the paths to which the original files and directories are to be copied.

Example 6-22. PP4EToolsvisitor_cpall.py

"""
Use: "python ...Toolsvisitor_cpall.py fromDir toDir trace?"
Like SystemFiletoolscpall.py, but with the visitor classes and os.walk;
does string replacement of fromDir with toDir at the front of all the names
that the walker passes in; assumes that the toDir does not exist initially;
"""

import os
from visitor import FileVisitor                       # visitor is in '.'
from PP4E.System.Filetools.cpall import copyfile      # PP4E is in a dir on path

class CpallVisitor(FileVisitor):
    def __init__(self, fromDir, toDir, trace=True):
        self.fromDirLen = len(fromDir) + 1
        self.toDir      = toDir
        FileVisitor.__init__(self, trace=trace)

    def visitdir(self, dirpath):
        toPath = os.path.join(self.toDir, dirpath[self.fromDirLen:])
        if self.trace: print('d', dirpath, '=>', toPath)
        os.mkdir(toPath)
        self.dcount += 1

    def visitfile(self, filepath):
        toPath = os.path.join(self.toDir, filepath[self.fromDirLen:])
        if self.trace: print('f', filepath, '=>', toPath)
        copyfile(filepath, toPath)
        self.fcount += 1

if __name__ == '__main__':
    import sys, time
    fromDir, toDir = sys.argv[1:3]
    trace = len(sys.argv) > 3
    print('Copying...')
    start = time.clock()
    walker = CpallVisitor(fromDir, toDir, trace)
    walker.run(startDir=fromDir)
    print('Copied', walker.fcount, 'files,', walker.dcount, 'directories', end=' ')
    print('in', time.clock() - start, 'seconds')

This version accomplishes roughly the same goal as the original, but it has made a few assumptions to keep the code simple. The “to” directory is assumed not to exist initially, and exceptions are not ignored along the way. Here it is copying the book examples tree from the prior edition again on Windows:

C:...PP4ETools> set PYTHONPATH
PYTHONPATH=C:UsersMarkStuffBooks4EPP4EdevExamples

C:...PP4ETools> rmdir /S copytemp
copytemp, Are you sure (Y/N)? y

C:...PP4ETools> visitor_cpall.py C:	empPP3EExamples copytemp
Copying...
Copied 1429 files, 186 directories in 11.1722033777 seconds

C:...PP4ETools> fc /B copytempPP3ELauncher.py
                         C:	empPP3EExamplesPP3ELauncher.py
Comparing files COPYTEMPPP3ELauncher.py and C:TEMPPP3EEXAMPLESPP3ELAUNCHER.PY
FC: no differences encountered

Despite the extra string slicing going on, this version seems to run just as fast as the original (the actual difference can be chalked up to system load variations). For tracing purposes, this version also prints all the “from” and “to” copy paths during the traversal if you pass in a third argument on the command line:

C:...PP4ETools> rmdir /S copytemp
copytemp, Are you sure (Y/N)? y

C:...PP4ETools> visitor_cpall.py C:	empPP3EExamples copytemp 1
Copying...
d C:	empPP3EExamples => copytemp
f C:	empPP3EExamplesREADME-root.txt => copytempREADME-root.txt
d C:	empPP3EExamplesPP3E => copytempPP3E
...more lines omitted: try this on your own for the full output...

Other Visitor Examples (External)

Although the visitor is widely applicable, we don’t have space to explore additional subclasses in this book. For more example clients and use cases, see the following examples in book’s examples distribution package described in the Preface:

Toolsvisitor_collect.py collects and/or prints files containing a search string
Toolsvisitor_poundbang.py replaces directory paths in “#!” lines at the top of Unix scripts
Toolsvisitor_cleanpyc.py is a visitor-based recoding of our earlier bytecode cleanup scripts
Toolsvisitor_bigpy.py is a visitor-based version of the “biggest file” example at the start of this chapter

Most of these are almost as trivial as the visitor_edit.py code in Example 6-19, because the visitor framework handles walking details automatically. The collector, for instance, simply appends to a list as a search visitor detects matched files and allows the default list of text filename extensions in the search visitor to be overridden per instance—it’s roughly like a combination of find and grep on Unix:

>>> from visitor_collect import CollectVisitor
>>> V = CollectVisitor('mimetypes', testexts=['.py', '.pyw'], trace=0)
>>> V.run(r'C:	empPP3EExamples')
>>> for name in V.matches: print(name)        # .py and .pyw files with 'mimetypes'
...
C:	empPP3EExamplesPP3EInternetEmailmailtoolsmailParser.py
C:	empPP3EExamplesPP3EInternetEmailmailtoolsmailSender.py
C:	empPP3EExamplesPP3EInternetFtpmirrordownloadflat.py
C:	empPP3EExamplesPP3EInternetFtpmirrordownloadflat_modular.py
C:	empPP3EExamplesPP3EInternetFtpmirrorftptools.py
C:	empPP3EExamplesPP3EInternetFtpmirroruploadflat.py
C:	empPP3EExamplesPP3ESystemMediaplayfile.py

C:...PP4ETools> visitor_collect.py mimetypes C:	empPP3EExamples   # as script

The core logic of the biggest-file visitor is similarly straightforward, and harkens back to chapter start:

class BigPy(FileVisitor):
    def __init__(self, trace=0):
        FileVisitor.__init__(self, context=[], trace=trace)

    def visitfile(self, filepath):
        FileVisitor.visitfile(self, filepath)
        if filepath.endswith('.py'):
            self.context.append((os.path.getsize(filepath), filepath))

And the bytecode-removal visitor brings us back full circle, showing an additional alternative to those we met earlier in this chapter. It’s essentially the same code, but it runs os.remove on “.pyc” file visits.

In the end, while the visitor classes are really just simple wrappers for os.walk, they further automate walking chores and provide a general framework and alternative class-based structure which may seem more natural to some than simple unstructured loops. They’re also representative of how Python’s OOP support maps well to real-world structures like file systems. Although os.walk works well for one-off scripts, the better extensibility, reduced redundancy, and greater encapsulation possible with OOP can be a major asset in real work as our needs change and evolve over time.

Note

In fact, those needs have changed over time. Between the third and fourth editions of this book, the original os.path.walk call was removed in Python 3.X, and os.walk became the only automated way to perform tree walks in the standard library. Examples from the prior edition that used os.path.walk were effectively broken. By contrast, although the visitor classes used this call, too, its clients did not. Because updating the visitor classes to use os.walk internally did not alter those classes’ interfaces, visitor-based tools continued to work unchanged.

This seems a prime example of the benefits of OOP’s support for encapsulation. Although the future is never completely predictable, in practice, user-defined tools like visitor tend to give you more control over changes than standard library tools like os.walk. Trust me on that; as someone who has had to update three Python books over the last 15 years, I can say with some certainty that Python change is a constant!

Playing Media Files

We have space for just one last, quick example in this chapter, so we’ll close with a bit of fun. Did you notice how the file extensions for text and binary file types were hard-coded in the directory search scripts of the prior two sections? That approach works for the trees they were applied to, but it’s not necessarily complete or portable. It would be better if we could deduce file type from file name automatically. That’s exactly what Python’s mimetypes module can do for us. In this section, we’ll use it to build a script that attempts to launch a file based upon its media type, and in the process develop general tools for opening media portably with specific or generic players.

As we’ve seen, on Windows this task is trivial—the os.startfile call opens files per the Windows registry, a system-wide mapping of file extension types to handler programs. On other platforms, we can either run specific media handlers per media type, or fall back on a resident web browser to open the file generically using Python’s webbrowser module. Example 6-23 puts these ideas into code.

Example 6-23. PP4ESystemMediaplayfile.py

#!/usr/local/bin/python
"""
##################################################################################
Try to play an arbitrary media file.  Allows for specific players instead of
always using general web browser scheme.  May not work on your system as is;
audio files use filters and command lines on Unix, and filename associations
on Windows via the start command (i.e., whatever you have on your machine to
run .au files--an audio player, or perhaps a web browser).  Configure and
extend as needed.  playknownfile assumes you know what sort of media you wish
to open, and playfile tries to determine media type automatically using Python
mimetypes module; both try to launch a web browser with Python webbrowser module
as a last resort when mimetype or platform unknown.
##################################################################################
"""

import os, sys, mimetypes, webbrowser

helpmsg = """
Sorry: can't find a media player for '%s' on your system!
Add an entry for your system to the media player dictionary
for this type of file in playfile.py, or play the file manually.
"""

def trace(*args): print(*args)   # with spaces between

##################################################################################
# player techniques: generic and otherwise: extend me
##################################################################################

class MediaTool:
    def __init__(self, runtext=''):
        self.runtext = runtext
    def run(self, mediafile, **options):            # most ignore options
        fullpath = os.path.abspath(mediafile)       # cwd may be anything
        self.open(fullpath, **options)

class Filter(MediaTool):
    def open(self, mediafile, **ignored):
        media  = open(mediafile, 'rb')
        player = os.popen(self.runtext, 'w')        # spawn shell tool
        player.write(media.read())                  # send to its stdin

class Cmdline(MediaTool):
    def open(self, mediafile, **ignored):
        cmdline = self.runtext % mediafile          # run any cmd line
        os.system(cmdline)                          # use %s for filename

class Winstart(MediaTool):                          # use Windows registry
    def open(self, mediafile, wait=False, **other): # or os.system('start file')
        if not wait:                                # allow wait for curr media
            os.startfile(mediafile)
        else:
            os.system('start /WAIT ' + mediafile)

class Webbrowser(MediaTool):
    # file:// requires abs path
    def open(self, mediafile, **options):
        webbrowser.open_new('file://%s' % mediafile, **options)

##################################################################################
# media- and platform-specific policies: change me, or pass one in
##################################################################################

# map platform to player: change me!

audiotools = {
    'sunos5':  Filter('/usr/bin/audioplay'),             # os.popen().write()
    'linux2':  Cmdline('cat %s > /dev/audio'),           # on zaurus, at least
    'sunos4':  Filter('/usr/demo/SOUND/play'),           # yes, this is that old!
    'win32':   Winstart()                                # startfile or system
   #'win32':   Cmdline('start %s')
    }

videotools = {
    'linux2':  Cmdline('tkcVideo_c700 %s'),              # zaurus pda
    'win32':   Winstart(),                               # avoid DOS pop up
    }

imagetools = {
    'linux2':  Cmdline('zimager %s'),                    # zaurus pda
    'win32':   Winstart(),
    }

texttools = {
    'linux2':  Cmdline('vi %s'),                         # zaurus pda
    'win32':   Cmdline('notepad %s')                     # or try PyEdit?
    }

apptools = {
    'win32':   Winstart()   # doc, xls, etc: use at your own risk!
    }

# map mimetype of filenames to player tables

mimetable = {'audio':       audiotools,
             'video':       videotools,
             'image':       imagetools,
             'text':        texttools,                   # not html text: browser
             'application': apptools}

##################################################################################
# top-level interfaces
##################################################################################

def trywebbrowser(filename, helpmsg=helpmsg, **options):
    """
    try to open a file in a web browser
    last resort if unknown mimetype or platform, and for text/html
    """
    trace('trying browser', filename)
    try:
        player = Webbrowser()                            # open in local browser
        player.run(filename, **options)
    except:
        print(helpmsg % filename)                        # else nothing worked

def playknownfile(filename, playertable={}, **options):
    """
    play media file of known type: uses platform-specific
    player objects, or spawns a web browser if nothing for
    this platform; accepts a media-specific player table
    """
    if sys.platform in playertable:
        playertable[sys.platform].run(filename, **options)     # specific tool
    else:
        trywebbrowser(filename, **options)                     # general scheme

def playfile(filename, mimetable=mimetable, **options):
    """
    play media file of any type: uses mimetypes to guess media
    type and map to platform-specific player tables; spawn web
    browser if text/html, media type unknown, or has no table
    """
    contenttype, encoding = mimetypes.guess_type(filename)        # check name
    if contenttype == None or encoding is not None:               # can't guess
        contenttype = '?/?'                                       # poss .txt.gz
    maintype, subtype = contenttype.split('/', 1)                 # 'image/jpeg'
    if maintype == 'text' and subtype == 'html':
        trywebbrowser(filename, **options)                        # special case
    elif maintype in mimetable:
        playknownfile(filename, mimetable[maintype], **options)   # try table
    else:
        trywebbrowser(filename, **options)                        # other types

###############################################################################
# self-test code
###############################################################################

if __name__ == '__main__':
    # media type known
    playknownfile('sousa.au', audiotools, wait=True)
    playknownfile('ora-pp3e.gif', imagetools, wait=True)
    playknownfile('ora-lp4e.jpg', imagetools)

    # media type guessed
    input('Stop players and press Enter')
    playfile('ora-lp4e.jpg')                     # image/jpeg
    playfile('ora-pp3e.gif')                     # image/gif
    playfile('priorcalendar.html')               # text/html
    playfile('lp4e-preface-preview.html')        # text/html
    playfile('lp-code-readme.txt')               # text/plain
    playfile('spam.doc')                         # app
    playfile('spreadsheet.xls')                  # app
    playfile('sousa.au', wait=True)              # audio/basic
    input('Done')                                # stay open if clicked

Although it’s generally possible to open most media files by passing their names to a web browser these days, this module provides a simple framework for launching media files with more specific tools, tailored by both media type and platform. A web browser is used only as a fallback option, if more specific tools are not available. The net result is an extendable media file player, which is as specific and portable as the customizations you provide for its tables.

We’ve seen the program launch tools employed by this script in prior chapters. The script’s main new concepts have to do with the modules it uses: the webbrowser module to open some files in a local web browser, as well as the Python mimetypes module to determine media type from file name. Since these are the heart of this code’s matter, let’s explore these briefly before we run the script.

The Python webbrowser Module

The standard library webbrowser module used by this example provides a portable interface for launching web browsers from Python scripts. It attempts to locate a suitable web browser on your local machine to open a given URL (file or web address) for display. Its interface is straightforward:

>>> import webbrowser
>>> webbrowser.open_new('file://' + fullfilename)         # use os.path.abspath()

This code will open the named file in a new web browser window using whatever browser is found on the underlying computer, or raise an exception if it cannot. You can tailor the browsers used on your platform, and the order in which they are attempted, by using the BROWSER environment variable and register function. By default, webbrowser attempts to be automatically portable across platforms.

Use an argument string of the form “file://...” or “http://...” to open a file on the local computer or web server, respectively. In fact, you can pass in any URL that the browser understands. The following pops up Python’s home page in a new locally-running browser window, for example:

>>> webbrowser.open_new('http://www.python.org')

Among other things, this is an easy way to display HTML documents as well as media files, as demonstrated by this section’s example. For broader applicability, this module can be used as both command-line script (Python’s -m module search path flag helps here) and as importable tool:

C:UsersmarkStuffWebsitespublic_html> python -m webbrowser about-pp.html
C:UsersmarkStuffWebsitespublic_html> python -m webbrowser -n about-pp.html
C:UsersmarkStuffWebsitespublic_html> python -m webbrowser -t about-pp.html

C:UsersmarkStuffWebsitespublic_html> python
>>> import webbrowser
>>> webbrowser.open('about-pp.html')            # reuse, new window, new tab
True
>>> webbrowser.open_new('about-pp.html')        # file:// optional on Windows
True
>>> webbrowser.open_new_tab('about-pp.html')
True

In both modes, the difference between the three usage forms is that the first tries to reuse an already-open browser window if possible, the second tries to open a new window, and the third tries to open a new tab. In practice, though, their behavior is totally dependent on what the browser selected on your platform supports, and even on the platform in general. All three forms may behave the same.

On Windows, for example, all three simply run os.startfile by default and thus create a new tab in an existing window under Internet Explorer 8. This is also why I didn’t need the “file://” full URL prefix in the preceding listing. Technically, Internet Explorer is only run if this is what is registered on your computer for the file type being opened; if not, that file type’s handler is opened instead. Some images, for example, may open in a photo viewer instead. On other platforms, such as Unix and Mac OS X, browser behavior differs, and non-URL file names might not be opened; use “file://” for portability.

We’ll use this module again later in this book. For example, the PyMailGUI program in Chapter 14 will employ it as a way to display HTML-formatted email messages and attachments, as well as program help. See the Python library manual for more details. In Chapters 13 and 15, we’ll also meet a related call, urllib.request.urlopen, which fetches a web page’s text given a URL, but does not open it in a browser; it may be parsed, saved, or otherwise used.

The Python mimetypes Module

To make this media player module even more useful, we also use the Python mimetypes standard library module to automatically determine the media type from the filename. We get back a type/subtype MIME content-type string if the type can be determined or None if the guess failed:

>>> import mimetypes
>>> mimetypes.guess_type('spam.jpg')
('image/jpeg', None)

>>> mimetypes.guess_type('TheBrightSideOfLife.mp3')
('audio/mpeg', None)

>>> mimetypes.guess_type('lifeofbrian.mpg')
('video/mpeg', None)

>>> mimetypes.guess_type('lifeofbrian.xyz')       # unknown type
(None, None)

Stripping off the first part of the content-type string gives the file’s general media type, which we can use to select a generic player; the second part (subtype) can tell us if text is plain or HTML:

>>> contype, encoding = mimetypes.guess_type('spam.jpg')
>>> contype.split('/')[0]
'image'

>>> mimetypes.guess_type('spam.txt')              # subtype is 'plain'
('text/plain', None)

>>> mimetypes.guess_type('spam.html')
('text/html', None)

>>> mimetypes.guess_type('spam.html')[0].split('/')[1]
'html'

A subtle thing: the second item in the tuple returned from the mimetypes guess is an encoding type we won’t use here for opening purposes. We still have to pay attention to it, though—if it is not None, it means the file is compressed (gzip or compress), even if we receive a media content type. For example, if the filename is something like spam.gif.gz, it’s a compressed image that we don’t want to try to open directly:

>>> mimetypes.guess_type('spam.gz')              # content unknown
(None, 'gzip')

>>> mimetypes.guess_type('spam.gif.gz')          # don't play me!
('image/gif', 'gzip')

>>> mimetypes.guess_type('spam.zip')             # archives
('application/zip', None)

>>> mimetypes.guess_type('spam.doc')             # office app files
('application/msword', None)

If the filename you pass in contains a directory path, the path portion is ignored (only the extension is used). This module is even smart enough to give us a filename extension for a type—useful if we need to go the other way, and create a file name from a content type:

>>> mimetypes.guess_type(r'C:songssousa.au')
('audio/basic', None)

>>> mimetypes.guess_extension('audio/basic')
'.au'

Try more calls on your own for more details. We’ll use the mimetypes module again in FTP examples in Chapter 13 to determine transfer type (text or binary), and in our email examples in Chapters 13, 14, and 16 to send, save, and open mail attachments.

In Example 6-23, we use mimetypes to select a table of platform-specific player commands for the media type of the file to be played. That is, we pick a player table for the file’s media type, and then pick a command from the player table for the platform. At both steps, we give up and run a web browser if there is nothing more specific to be done.

Using mimetypes guesses for SearchVisitor

To use this module for directing our text file search scripts we wrote earlier in this chapter, simply extract the first item in the content-type returned for a file’s name. For instance, all in the following list are considered text (except “.pyw”, which we may have to special-case if we must care):

>>> for ext in ['.txt', '.py', '.pyw', '.html', '.c', '.h', '.xml']:
...     print(ext, mimetypes.guess_type('spam' + ext))
...
.txt ('text/plain', None)
.py ('text/x-python', None)
.pyw (None, None)
.html ('text/html', None)
.c ('text/plain', None)
.h ('text/plain', None)
.xml ('text/xml', None)

We can add this technique to our earlier SearchVisitor class by redefining its candidate selection method, in order to replace its default extension lists with mimetypes guesses—yet more evidence of the power of OOP customization at work:

C:...PP4ETools> python
>>> import mimetypes
>>> from visitor import SearchVisitor             # or PP4E.Tools.visitor if not .
>>>
>>> class SearchMimeVisitor(SearchVisitor):
...     def candidate(self, fname):
...         contype, encoding = mimetypes.guess_type(fname)
...         return (contype and
...                 contype.split('/')[0] == 'text' and
...                 encoding == None)
...
>>> V = SearchMimeVisitor('mimetypes', trace=0)             # search key
>>> V.run(r'C:	empPP3EExamples')                         # root dir
C:	empPP3EExamplesPP3EextrasLosAlamosAdvancedClassday1-systemdata.txt ha
s mimetypes
C:	empPP3EExamplesPP3EInternetEmailmailtoolsmailParser.py has mimetypes
C:	empPP3EExamplesPP3EInternetEmailmailtoolsmailSender.py has mimetypes
C:	empPP3EExamplesPP3EInternetFtpmirrordownloadflat.py has mimetypes
C:	empPP3EExamplesPP3EInternetFtpmirrordownloadflat_modular.py has mimet
ypes
C:	empPP3EExamplesPP3EInternetFtpmirrorftptools.py has mimetypes
C:	empPP3EExamplesPP3EInternetFtpmirroruploadflat.py has mimetypes
C:	empPP3EExamplesPP3ESystemMediaplayfile.py has mimetypes
>>> V.scount, V.fcount, V.dcount
(8, 1429, 186)

Because this is not completely accurate, though (you may need to add logic to include extensions like “.pyw” missed by the guess), and because it’s not even appropriate for all search clients (some may want to search specific kinds of text only), this scheme was not used for the original class. Using and tailoring it for your own searches is left as optional exercise.

Running the Script

Now, when Example 6-23 is run from the command line, if all goes well its canned self-test code at the end opens a number of audio, image, text, and other file types located in the script’s directory, using either platform-specific players or a general web browser. On my Windows 7 laptop, GIF and HTML files open in new IE browser tabs; JPEG files in Windows Photo Viewer; plain text files in Notepad; DOC and XLS files in Microsoft Word and Excel; and audio files in Windows Media Player.

Because the programs used and their behavior may vary widely from machine to machine, though, you’re best off studying this script’s code and running it on your own computer and with your own test files to see what happens. As usual, you can also test it interactively (use the package path like this one to import from a different directory, assuming your module search path includes the PP4E root):

>>> from PP4E.System.Media.playfile import playfile
>>> playfile(r'C:moviesmov10428.mpg')                       # video/mpeg

We’ll use the playfile module again as an imported library like this in Chapter 13 to open media files downloaded by FTP. Again, you may want to tweak this script’s tables for your players. This script also assumes the media file is located on the local machine (even though the webbrowser module supports remote files with “http://” names), and it does not currently allow different players for most different MIME subtypes (it special-cases text to handle “plain” and “html” differently, but no others). In fact, this script is really just something of a simple framework that was designed to be extended. As always, hack on; this is Python, after all.

Automated Program Launchers (External)

Finally, some optional reading—in the examples distribution package for this book (available at sites listed in the Preface) you can find additional system-related scripts we do not have space to cover here:

PP4ELauncher.py—contains tools used by some GUI programs later in the book to start Python programs without any environment configuration. Roughly, it sets up both the system path and module import search paths as needed to run book examples, which are inherited by spawned programs. By using this module to search for files and configure environments automatically, users can avoid (or at least postpone) having to learn the intricacies of manual environment configuration before running programs. Though there is not much new in this example from a system interfaces perspective, we’ll refer back to it later, when we explore GUI programs that use its tools, as well as those of its launchmodes cousin, which we wrote in Chapter 5.
PP4ELaunch_PyDemos.pyw and PP4ELaunch_PyGadgets_bar.pyw—use Launcher.py to start major GUI book examples without any environment configuration. Because all spawned processes inherit configurations performed by the launcher, they all run with proper search path settings. When run directly, the underlying PyDemos2.pyw and PyGadgets_bar.pyw scripts (which we’ll explore briefly at the end of Chapter 10) instead rely on the configuration settings on the underlying machine. In other words, Launcher effectively hides configuration details from the GUI interfaces by enclosing them in a configuration program layer.
PP4ELaunchBrowser.pyw—portably locates and starts an Internet web browser program on the host machine in order to view a local file or remote web page. In prior versions, it used tools in Launcher.py to search for a reasonable browser to run. The original version of this example has now been largely superseded by the standard library’s webbrowser module, which arose after this example had been developed (reptilian minds think alike!). In this edition, LaunchBrowser simply parses command-line arguments for backward compatibility and invokes the open function in webbrowser. See this module’s help text, or PyGadgets and PyDemos in Chapter 10, for example command-line usage.

That’s the end of our system tools exploration. In the next part of this book we leave the realm of the system shell and move on to explore ways to add graphical user interfaces to our program. Later, we’ll do the same using web-based approaches. As we continue, keep in mind that the system tools we’ve studied in this part of the book see action in a wide variety of programs. For instance, we’ll put threads to work to spawn long-running tasks in the GUI part, use both threads and processes when exploring server implementations in the Internet part, and use files and file-related system calls throughout the remainder of the book.

Whether your interfaces are command lines, multiwindow GUIs, or distributed client/server websites, Python’s system interfaces toolbox is sure to play a important part in your Python programming future.

^[18]For a related print issue, see Chapter 14’s workaround for program aborts when printing stack tracebacks to standard output from spawned programs. Unlike the problem described here, that issue does not appear to be related to Unicode characters that may be unprintable in shell windows but reflects another regression for standard output prints in general in Python 3.1, which may or may not be repaired by the time you read this text. See also the Python environment variable PYTHONIOENCODING, which can override the default encoding used for standard streams.

^[19]I should note that this background story stems from the second edition of this book, written in 2000. Some ten years later, floppies have largely gone the way of the parallel port and the dinosaur. Moreover, burning a CD or DVD is no longer as painful as it once was; there are new options today such as large flash memory cards, wireless home networks, and simple email; and naturally, my home computers configuration isn’t what it once was. For that matter, some of my kids are no longer kids (though they’ve retained some backward compatibility with their former selves).

^[20]It turns out that the zip, gzip, and tar commands can all be replaced with pure Python code today, too. The gzip module in the Python standard library provides tools for reading and writing compressed gzip files, usually named with a .gz filename extension. It can serve as an all-Python equivalent of the standard gzip and gunzip command-line utility programs. This built-in module uses another module called zlib that implements gzip-compatible data compressions. In recent Python releases, the zipfile module can be imported to make and use ZIP format archives (zip is an archive and compression format, gzip is a compression scheme), and the tarfile module allows scripts to read and write tar archives. See the Python library manual for details.

^[21]It happens. In fact, most people who spend any substantial amount of time in cyberspace could probably tell a horror story or two. Mine goes like this: a number of years ago, I had an account with an ISP that went completely offline for a few weeks in response to a security breach by an ex-employee. Worse, not only was personal email disabled, but queued up messages were permanently lost. If your livelihood depends on email and the Web as much as mine does, you’ll appreciate the havoc such an outage can wreak.

^[22]In fact, the act of searching files often goes by the colloquial name “grepping” among developers who have spent any substantial time in the Unix ghetto.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Complete System Programs

Create new playlist

Sign In

Sign Up

Chapter 6. Complete System Programs

“The Greps of Wrath”

A Quick Game of “Find the Biggest Python File”

Scanning the Standard Library Directory

Scanning the Standard Library Tree

Scanning the Module Search Path

Scanning the Entire Machine

Printing Unicode Filenames

Splitting and Joining Files

Splitting Files Portably

Joining Files Portably

Usage Variations

Generating Redirection Web Pages

Page Template File

Page Generator Script

A Regression Test Script

Running the Test Driver

Copying Directory Trees

Comparing Directory Trees

Finding Directory Differences

Finding Tree Differences

Running the Script

Verifying Backups

Reporting Differences and Other Ideas

Searching Directory Trees

Greps and Globs and Finds

Rolling Your Own find Module

The fnmatch module

Note

Cleaning Up Bytecode Files

A Python Tree Searcher

Note

Visitor: Walking Directories “++”

Editing Files in Directory Trees (Visitor)

Global Replacements in Directory Trees (Visitor)

Counting Source Code Lines (Visitor)

Recoding Copies with Classes (Visitor)

Other Visitor Examples (External)

Note

Playing Media Files

The Python webbrowser Module

The Python mimetypes Module

Using mimetypes guesses for SearchVisitor

Running the Script

Automated Program Launchers (External)

Table of Contents for
6. Complete System Programs