Engineers love to change things. As I was writing this book, I found it almost irresistible to move and rename directories, variables, and shared modules in the book examples tree whenever I thought I’d stumbled onto a more coherent structure. That was fine early on, but as the tree became more intertwined, this became a maintenance nightmare. Things such as program directory paths and module names were hardcoded all over the place—in package import statements, program startup calls, text notes, configuration files, and more.
One way to repair these references, of course, is to edit every
file in the directory by hand, searching each for information that has
changed. That’s so tedious as to be utterly impossible in this book’s
examples tree, though; as I wrote these words, the examples tree
contained 118 directories and 1,342 files! (To count for yourself, run
a command-line python PyTools/visitor.py
1
in the PP3E examples root directory.)
Clearly, I needed a way to automate updates after changes.
There is a standard way to search files for strings on Unix
and Linux systems: the command-line program grep
and its relatives list all lines in
one or more files containing a string or string pattern.[†] Given that Unix shells expand (i.e., “glob”) filename
patterns automatically, a command such as grep popen *.py
will search a single
directory’s Python files for the string "popen"
. Here’s such a command in action
on Windows (I installed a commercial Unix-like fgrep
program on my Windows laptop because
I missed it too much there):
C:...PP3ESystemFiletools>fgrep popen *.py
diffall.py:# - we could also os.popen a diff (unix) or fc (dos)
dirdiff.py:# - use os.popen('ls...') or glob.glob + os.path.split
dirdiff6.py: files1 = os.popen('ls %s' % dir1).readlines( )
dirdiff6.py: files2 = os.popen('ls %s' % dir2).readlines( )
testdirdiff.py: expected = expected + os.popen(test % 'dirdiff').read( )
testdirdiff.py: output = output + os.popen(test % script).read( )
DOS has a command for searching files too—find
, not to be confused with the Unix
find directory walker command:
C:...PP3ESystemFiletools>find /N "popen" testdirdiff.py
---------- testdirdiff.py
[8] expected = expected + os.popen(test % 'dirdiff').read( )
[15] output = output + os.popen(test % script).read( )
You can do the same within a Python script by running the
previously mentioned shell command with os.system
or os.popen
. Until recently, this could also
be done by combining the (now defunct) grep
and glob
built-in modules. We met the glob
module in Chapter 4; it expands a filename
pattern into a list of matching filename strings (much like a Unix
shell). In the past, the standard library also included a grep
module, which acted like a Unix
grep
command: grep.grep
printed lines containing a pattern string among a set
of files. When used with glob
,
the effect was much like that of the fgrep
command:
>>>from grep import grep
>>>from glob import glob
>>>grep('popen', glob('*.py'))
diffall.py: 16: # - we could also os.popen a diff (unix) or fc (dos) dirdiff.py: 12: # - use os.popen('ls...') or glob.glob + os.path.split dirdiff6.py: 19: files1 = os.popen('ls %s' % dir1).readlines( ) dirdiff6.py: 20: files2 = os.popen('ls %s' % dir2).readlines( ) testdirdiff.py: 8: expected = expected + os.popen(test % 'dirdiff')... testdirdiff.py: 15: output = output + os.popen(test % script).read( ) >>>import glob, grep
>>>grep.grep('system', glob.glob('*.py'))
dirdiff.py: 16: # - on unix systems we could do something similar by regtest.py: 18: os.system('%s < %s > %s.out 2>&1' % (program, ... regtest.py: 23: os.system('%s < %s > %s.out 2>&1' % (program, ... regtest.py: 24: os.system('diff %s.out %s.out.bkp > %s.diffs' ...
Unfortunately, the grep
module, much like the original find
module discussed at the end of Chapter 4, has been removed from the
standard library in the time since I wrote this example for the
second edition of this book (it was limited to printing results, and
so is less general than other tools). On Unix systems, we can work
around its demise by running a grep
shell command from within a find
shell command. For instance, the
following Unix command line:
find . -name "*.py" -print -exec fgrep popen {} ;
would pinpoint lines and files at and below the current
directory that mention popen
. If
you happen to have a Unix-like find
command on every machine you will
ever use, this is one way to process directories.
For instance, I used to run the script in Example 7-8 on some of my machines to remove all .pyc bytecode files in the examples tree before packaging or upgrading Pythons (it’s not impossible that old binary bytecode files are not forward compatible with newer Python releases).
Example 7-8. PP3EPyToolscleanpyc.py
######################################################################### # find and delete all "*.pyc" bytecode files at and below the directory # where this script is run; this assumes a Unix-like find command, and # so is very nonportable; we could instead use the Python find module, # or just walk the directory trees with portable Python code; the find # -exec option can apply a Python script to each file too; ######################################################################### import os, sys if sys.platform[:3] == 'win': findcmd = r'c:stuffin.mksfind . -name "*.pyc" -print' else: findcmd = 'find . -name "*.pyc" -print' print findcmd count = 0 for file in os.popen(findcmd).readlines( ): # for all filenames count += 1 # have at the end print str(file[:-1]) os.remove(file[:-1]) print 'Removed %d .pyc files' % count
This script uses os.popen
to collect the output of a commercial package’s find
program installed on one of my
Windows computers, or else the standard find
tool on the Linux side. It’s also
completely nonportable to Windows machines
that don’t have the commercial Unix-like find
program installed, and that
includes other computers in my house, not to mention those
throughout most of the world at large.
Python scripts can reuse underlying shell tools with
os.popen
, but by so doing they
lose much of the portability advantage of the Python language. The
Unix find
command is not
universally available and is a complex tool by itself (in fact,
too complex to cover in this book; see a Unix manpage for more
details). As we saw in Chapter
4, spawning a shell command also incurs a performance hit,
because it must start a new independent program on your
computer.
To avoid some of the portability and performance costs of
spawning an underlying find
command, I eventually recoded this script to use the find
utilities we met and wrote in Chapter 4. The new script is shown
in Example 7-9.
Example 7-9. PP3EPyToolscleanpyc-py.py
########################################################################## # find and delete all "*.pyc" bytecode files at and below the directory # where this script is run; this uses a Python find call, and so is # portable to most machines; run this to delete .pyc's from an old Python # release; cd to the directory you want to clean before running; ########################################################################## import os, sys, find # here, gets PyTools find count = 0 for file in find.find("*.pyc"): # for all filenames count += 1 print file os.remove(file) print 'Removed %d .pyc files' % count
This works portably, and it avoids external program startup
costs. But find
is really just
a tree searcher that doesn’t let you hook into the tree search—if
you need to do something unique while traversing a directory tree,
you may be better off using a more manual approach. Moreover,
find
must collect all names
before it returns; in very large directory trees, this may
introduce significant performance and memory penalties. It’s not
an issue for my trees, but it could be for yours.
To help ease the task of performing global searches on all platforms I might ever use, I coded a Python script to do most of the work for me. Example 7-10 employs the following standard Python tools that we met in the preceding chapters:
os.path.walk
to visit
files in a directory
find
string method to
search for a string in a text read from a file
os.path.splitext
to
skip over files with binary-type extensions
os.path.join
to
portably combine a directory path and filename
os.path.isdir
to skip
paths that refer to directories, not files
Because it’s pure Python code, though, it can be run the same
way on both Linux and Windows. In fact, it should work on any
computer where Python has been installed. Moreover, because it uses
direct system calls, it will likely be faster than using op.popen
to spawn a find
command that spawns many grep
commands.
Example 7-10. PP3EPyToolssearch_all.py
############################################################################ # Use: "python ....PyToolssearch_all.py string". # search all files at and below current directory for a string; uses the # os.path.walk interface, rather than doing a find to collect names first; ############################################################################ import os, sys listonly = False skipexts = ['.gif', '.exe', '.pyc', '.o', '.a'] # ignore binary files def visitfile(fname, searchKey): # for each non-dir file global fcount, vcount # search for string print vcount+1, '=>', fname # skip protected files try: if not listonly: if os.path.splitext(fname)[1] in skipexts: print 'Skipping', fname elif open(fname).read( ).find(searchKey) != -1: raw_input('%s has %s' % (fname, searchKey)) fcount += 1 except: pass vcount += 1 def visitor(myData, directoryName, filesInDirectory): # called for each dir for fname in filesInDirectory: # do non-dir files here fpath = os.path.join(directoryName, fname) # fnames have no dirpath if not os.path.isdir(fpath): # myData is searchKey visitfile(fpath, myData) def searcher(startdir, searchkey): global fcount, vcount fcount = vcount = 0 os.path.walk(startdir, visitor, searchkey) if _ _name_ _ == '_ _main_ _': searcher('.', sys.argv[1]) print 'Found in %d files, visited %d' % (fcount, vcount)
This file also uses the sys.argv
command-line list and the
_ _name_ _
trick for running in
two modes. When run standalone, the search key is passed on the
command line; when imported, clients call this module’s searcher
function directly. For example,
to search (grep) for all appearances of the directory name “Part2”
in the examples tree (an old directory that really did go away!),
run a command line like this in a DOS or Unix shell:
C:...PP3E>python PyToolssearch_all.py Part2
1 => .autoexec.bat 2 => .cleanall.csh 3 => .echoEnvironment.pyw 4 => .Launcher.py.Launcher.py has Part2
5 => .Launcher.pyc Skipping .Launcher.pyc 6 => .Launch_PyGadgets.py 7 => .Launch_PyDemos.pyw 8 => .LaunchBrowser.out.txt.LaunchBrowser.out.txt has Part2
9 => .LaunchBrowser.py.LaunchBrowser.py has Part2
... ...more lines deleted ... 1339 => .old_Part2Basicsunpack2b.py 1340 => .old_Part2Basicsunpack3.py 1341 => .old_Part2Basics\_ _init_ _.py Found in 74 files, visited 1341
The script lists each file it checks as it goes, tells you
which files it is skipping (names that end in extensions listed in
the variable skipexts
that imply
binary data), and pauses for an Enter key press each time it
announces a file containing the search string (bold lines). A
solution based on find
could not
pause this way; although trivial in this example, find
doesn’t return until the entire tree
traversal is finished. The search_all
script works the same way when
it is imported rather than run, but there is no
final statistics output line (fcount
and vcount
live in the module and so would
have to be imported to be inspected here):
>>>from PP3E.PyTools.search_all import searcher
>>>searcher('.', '-exec')
# find files with string '-exec' 1 => .autoexec.bat 2 => .cleanall.csh 3 => .echoEnvironment.pyw 4 => .Launcher.py 5 => .Launcher.pyc Skipping .Launcher.pyc 6 => .Launch_PyGadgets.py 7 => .Launch_PyDemos.pyw 8 => .LaunchBrowser.out.txt 9 => .LaunchBrowser.py 10 => .Launch_PyGadgets_bar.pyw 11 => .makeall.csh 12 => .package.csh.package.csh has -exec
...more lines deleted...
However launched, this script tracks down all references to a string in an entire directory tree—a name of a changed book examples file, object, or directory, for instance.[*]
[†] In fact, the act of searching files often goes by the colloquial name “grepping” among developers who have spent any substantial time in the Unix ghetto.
[*] See the coverage of regular expressions in Chapter 21. The search_all
script here searches for a
simple string in each file with the string find
method, but it would be trivial
to extend it to search for a regular expression pattern match
instead (roughly, just replace find
with a call to a regular
expression object’s search method). Of course, such a mutation
will be much more trivial after we’ve learned how to do it. Also
notice the skipexts
list in
Example 7-10, which
attempts to list all possible binary file types: it would be
more general and robust to use the mimetypes
logic we met at the end of
Chapter 6 in order to guess
file content type from its name.
18.119.163.238