The next three sections conclude this chapter by exploring a handful of additional utilities for processing directories (a.k.a. folders) on your computer with Python. They present directory copy, deletion, and comparison scripts that demonstrate system tools at work. All of these were born of necessity, are generally portable among all Python platforms, and illustrate Python development concepts along the way.
Some of these scripts do something too unique for the visitor
module’s classes we’ve been applying
in early sections of this chapter, and so require more custom
solutions (e.g., we can’t remove directories we intend to walk
through). Most have platform-specific equivalents too (e.g.,
drag-and-drop copies), but the Python utilities shown here are
portable, easily customized, callable from other scripts, and
surprisingly fast.
My CD writer sometimes does weird things. In fact, copies of files with odd names can be totally botched on the CD, even though other files show up in one piece. That’s not necessarily a showstopper; if just a few files are trashed in a big CD backup copy, I can always copy the offending files to floppies one at a time. Unfortunately, Windows drag-and-drop copies don’t play nicely with such a CD: the copy operation stops and exits the moment the first bad file is encountered. You get only as many files as were copied up to the error, but no more.
In fact, this is not limited to CD copies. I’ve run into similar problems when trying to back up my laptop’s hard drive to another drive—the drag-and-drop copy stops with an error as soon as it reaches a file with a name that is too long to copy (common in saved web pages). The last 45 minutes spent copying is wasted time; frustrating, to say the least!
There may be some magical Windows setting to work around this feature, but I gave up hunting for one as soon as I realized that it would be easier to code a copier in Python. The cpall.py script in Example 7-25 is one way to do it. With this script, I control what happens when bad files are found—I can skip over them with Python exception handlers, for instance. Moreover, this tool works with the same interface and effect on other platforms. It seems to me, at least, that a few minutes spent writing a portable and reusable Python script to meet a need is a better investment than looking for solutions that work on only one platform (if at all).
Example 7-25. PP3ESystemFiletoolscpall.py
############################################################################ # Usage: "python cpall.py dirFrom dirTo". # Recursive copy of a directory tree. Works like a "cp -r dirFrom/* dirTo" # Unix command, and assumes that dirFrom and dirTo are both directories. # Was written to get around fatal error messages under Windows drag-and-drop # copies (the first bad file ends the entire copy operation immediately), # but also allows for coding customized copy operations. May need to # do more file type checking on Unix: skip links, fifos, etc. ############################################################################ import os, sys verbose = 0 dcount = fcount = 0 maxfileload = 500000 blksize = 1024 * 100 def cpfile(pathFrom, pathTo, maxfileload=maxfileload): """ copy file pathFrom to pathTo, byte for byte """ if os.path.getsize(pathFrom) <= maxfileload: bytesFrom = open(pathFrom, 'rb').read( ) # read small file all at once open(pathTo, 'wb').write(bytesFrom) # need b mode on Windows else: fileFrom = open(pathFrom, 'rb') # read big files in chunks fileTo = open(pathTo, 'wb') # need b mode here too while 1: bytesFrom = fileFrom.read(blksize) # get one block, less at end if not bytesFrom: break # empty after last chunk fileTo.write(bytesFrom) def cpall(dirFrom, dirTo): """ copy contents of dirFrom and below to dirTo """ global dcount, fcount for file in os.listdir(dirFrom): # for files/dirs here pathFrom = os.path.join(dirFrom, file) pathTo = os.path.join(dirTo, file) # extend both paths if not os.path.isdir(pathFrom): # copy simple files try: if verbose > 1: print 'copying', pathFrom, 'to', pathTo cpfile(pathFrom, pathTo) fcount = fcount+1 except: print 'Error copying', pathFrom, 'to', pathTo, '--skipped' print sys.exc_info()[0], sys.exc_info( )[1] else: if verbose: print 'copying dir', pathFrom, 'to', pathTo try: os.mkdir(pathTo) # make new subdir cpall(pathFrom, pathTo) # recur into subdirs dcount = dcount+1 except: print 'Error creating', pathTo, '--skipped' print sys.exc_info()[0], sys.exc_info( )[1] def getargs( ): try: dirFrom, dirTo = sys.argv[1:] except: print 'Use: cpall.py dirFrom dirTo' else: if not os.path.isdir(dirFrom): print 'Error: dirFrom is not a directory' elif not os.path.exists(dirTo): os.mkdir(dirTo) print 'Note: dirTo was created' return (dirFrom, dirTo) else: print 'Warning: dirTo already exists' if dirFrom == dirTo or (hasattr(os.path, 'samefile') and os.path.samefile(dirFrom, dirTo)): print 'Error: dirFrom same as dirTo' else: return (dirFrom, dirTo) if _ _name_ _ == '_ _main_ _': import time dirstuple = getargs( ) if dirstuple: print 'Copying...' start = time.time( ) cpall(*dirstuple) print 'Copied', fcount, 'files,', dcount, 'directories', print 'in', time.time( ) - start, 'seconds'
This script implements its own recursive tree traversal logic
and keeps track of both the “from” and “to” directory paths as it
goes. At every level, it copies over simple files, creates
directories in the “to” path, and recurs into subdirectories with
“from” and “to” paths extended by one level. There are other ways to
code this task (e.g., other cpall
variants in the book’s examples distribution change the working
directory along the way with os.chdir
calls), but extending paths on
descent works well in practice.
Notice this script’s reusable cpfile
function—just in case there are multigigabyte files
in the tree to be copied, it uses a file’s size to decide whether it
should be read all at once or in chunks (remember, the file read
method without arguments actually
loads the entire file into an in-memory string). We choose fairly
large file and block sizes, because the more we read at once in
Python, the faster our scripts will typically run. This is more
efficient than it may sound; strings left behind by prior reads will
be garbage collected and reused as we go.
Also note that this script creates the “to” directory if needed, but it assumes that the directory is empty when a copy starts up; be sure to remove the target directory before copying a new tree to its name (more on this in the next section).
Here is a big book examples tree copy in action on Windows;
pass in the name of the “from” and “to” directories to kick off the
process, redirect the output to a file if there are too many error
messages to read all at once (e.g., >
output.txt
), and run an rm
shell command (or similar
platform-specific tool) to delete the target directory first if
needed:
C: emp>rm -rf cpexamples
C: emp>python %X%systemfiletoolscpall.py examples cpexamples
Note: dirTo was created Copying... Copied 1356 files, 118 directories in 2.41999995708 seconds C: emp>fc /B examplesSystemFiletoolscpall.py
cpexamplesSystemFiletoolscpall.py
Comparing files examplesSystemFiletoolscpall.py and cpexamplesSystemFiletoolscpall.py FC: no differences encountered
At the time I wrote this example in 2000, this test run copied
a tree of 1,356 files and 118 directories in 2.4 seconds on my 650
MHz Windows 98 laptop (the built-in time.time
call can be used to query the
system time in seconds). It runs a bit slower if some other programs
are open on the machine, and may run arbitrarily faster or slower
for you. Still, this is at least as fast as the best drag-and-drop
I’ve timed on Windows.
So how does this script work around bad files on a CD backup? The secret is that it catches and ignores file exceptions, and it keeps walking. To copy all the files that are good on a CD, I simply run a command line such as this one:
C: emp>python %X%systemfiletoolscpall_visitor.py
g:PP3rdEdexamplesPP3E cpexamples
Because the CD is addressed as “G:” on my Windows machine, this is the command-line equivalent of drag-and-drop copying from an item in the CD’s top-level folder, except that the Python script will recover from errors on the CD and get the rest. On copy errors, it prints a message to standard output and continues; for big copies, you’ll probably want to redirect the script’s output to a file for later inspection.
In general, cpall
can be
passed any absolute directory path on your machine, even those that
indicate devices such as CDs. To make this go on Linux, try a root
directory such as /dev/cdrom or something
similar to address your CD drive.
When I first wrote the cpall
script just discussed, I couldn’t
see a way that the visitor
class
hierarchy we met earlier would help. Two
directories needed to be traversed in parallel (the original and the
copy), and visitor
is based on
climbing one tree with os.path.walk
. There seemed no easy way to
keep track of where the script was in the copy directory.
The trick I eventually stumbled onto is not to keep track at
all. Instead, the script in Example 7-26 simply replaces
the “from” directory path string with the “to” directory path
string, at the front of all directory names and pathnames passed in
from os.path.walk
. The results of
the string replacements are the paths to which the original files
and directories are to be copied.
Example 7-26. PP3ESystemFiletoolscpall_visitor.py
########################################################### # Use: "python cpall_visitor.py fromDir toDir" # cpall, but with the visitor classes and os.path.walk; # the trick is to do string replacement of fromDir with # toDir at the front of all the names walk passes in; # assumes that the toDir does not exist initially; ########################################################### import os from PP3E.PyTools.visitor import FileVisitor from cpall import cpfile, getargs verbose = True class CpallVisitor(FileVisitor): def _ _init_ _(self, fromDir, toDir): self.fromDirLen = len(fromDir) + 1 self.toDir = toDir FileVisitor._ _init_ _(self) def visitdir(self, dirpath): toPath = os.path.join(self.toDir, dirpath[self.fromDirLen:]) if verbose: print 'd', dirpath, '=>', toPath os.mkdir(toPath) self.dcount += 1 def visitfile(self, filepath): toPath = os.path.join(self.toDir, filepath[self.fromDirLen:]) if verbose: print 'f', filepath, '=>', toPath cpfile(filepath, toPath) self.fcount += 1 if _ _name_ _ == '_ _main_ _': import sys, time fromDir, toDir = sys.argv[1:3] if len(sys.argv) > 3: verbose = 0 print 'Copying...' start = time.time( ) walker = CpallVisitor(fromDir, toDir) walker.run(startDir=fromDir) print 'Copied', walker.fcount, 'files,', walker.dcount, 'directories', print 'in', time.time( ) - start, 'seconds'
This version accomplishes roughly the same goal as the original, but it has made a few assumptions to keep code simple. The “to” directory is assumed not to exist initially, and exceptions are not ignored along the way. Here it is copying the book examples tree again on Windows:
C: emp>rm -rf cpexamples
C: emp>python %X%systemfiletoolscpall_visitor.py
examples cpexamples -quiet
Copying... Copied 1356 files, 119 directories in 2.09000003338 seconds C: emp>fc /B examplesSystemFiletoolscpall.py
cpexamplesSystemFiletoolscpall.py
Comparing files examplesSystemFiletoolscpall.py and cpexamplesSystemFiletoolscpall.py FC: no differences encountered
Despite the extra string slicing going on, this version runs
just as fast as the original. For tracing purposes, this version
also prints all the “from” and “to” copy paths during the traversal
unless you pass in a third argument on the command line or set the
script’s verbose
variable to
False
or 0
:
C: emp>python %X%systemfiletoolscpall_visitor.py examples cpexamples
Copying...
d examples => cpexamples
f examplesautoexec.bat => cpexamplesautoexec.bat
f examplescleanall.csh => cpexamplescleanall.csh
...more deleted...
d examplesSystem => cpexamplesSystem
f examplesSystemSystem.txt => cpexamplesSystemSystem.txt
f examplesSystemmore.py => cpexamplesSystemmore.py
f examplesSystem
eader.py => cpexamplesSystem
eader.py
...more deleted...
Copied 1356 files, 119 directories in 2.31000006199 seconds
52.15.176.80