Engineers can be a paranoid sort (but you didn’t hear that from me). At least I am. It comes from decades of seeing things go terribly wrong, I suppose. When I create a CD backup of my hard drive, for instance, there’s still something a bit too magical about the process to trust the CD writer program to do the right thing. Maybe I should, but it’s tough to have a lot of faith in tools that occasionally trash files and seem to crash my Windows machine every third Tuesday of the month. When push comes to shove, it’s nice to be able to verify that data copied to a backup CD is the same as the original—or at least to spot deviations from the original—as soon as possible. If a backup is ever needed, it will be really needed.
Because data CDs are accessible as simple directory trees, we
are once again in the realm of tree walkers—to verify a backup CD, we
simply need to walk its top-level directory. If our script is general
enough, we will also be able to use it to verify other copy operations
as well—e.g., downloaded tar files, hard-drive backups, and so on. In
fact, the combination of the cpall
script we saw earlier and a general tree comparison would provide a
portable and scriptable way to copy and verify data sets.
We’ve already written a generic walker class (the visitor
module), but it won’t help us here
directly: we need to walk two directories in
parallel and inspect common files along the way. Moreover, walking
either one of the two directories won’t allow us to spot files and
directories that exist only in the other. Something more custom seems
in order here.
Before we start coding, the first thing we need to clarify is what it means to compare two directory trees. If both trees have exactly the same branch structure and depth, this problem reduces to comparing corresponding files in each tree. In general, though, the trees can have arbitrarily different shapes, depths, and so on.
More generally, the contents of a directory in one tree may have more or fewer entries than the corresponding directory in the other tree. If those differing contents are filenames, there is no corresponding file to compare with; if they are directory names, there is no corresponding branch to descend through. In fact, the only way to detect files and directories that appear in one tree but not the other is to detect differences in each level’s directory.
In other words, a tree comparison algorithm will also have to perform directory comparisons along the way. Because this is a nested and simpler operation, let’s start by coding and debugging a single-directory comparison of filenames in Example 7-29.
Example 7-29. PP3ESystemFiletoolsdirdiff.py
#!/bin/env python ############################################################################# # use: python dirdiff.py dir1-path dir2-path # compare two directories to find files that exist in one but not the other; # this version uses the os.listdir function and list difference; note that # this script only checks filenames, not file contents--see diffall.py for an # extension that does the latter by comparing .read( ) results; ############################################################################# import os, sys def reportdiffs(unique1, unique2, dir1, dir2): if not (unique1 or unique2): print 'Directory lists are identical' else: if unique1: print 'Files unique to', dir1 for file in unique1: print '...', file if unique2: print 'Files unique to', dir2 for file in unique2: print '...', file def unique(seq1, seq2): uniques = [] # return items in seq1 only for item in seq1: if item not in seq2: uniques.append(item) return uniques def comparedirs(dir1, dir2): print 'Comparing', dir1, 'to', dir2 files1 = os.listdir(dir1) files2 = os.listdir(dir2) unique1 = unique(files1, files2) unique2 = unique(files2, files1) reportdiffs(unique1, unique2, dir1, dir2) return not (unique1 or unique2) # true if no diffs def getargs( ): try: dir1, dir2 = sys.argv[1:] # 2 command-line args except: print 'Usage: dirdiff.py dir1 dir2' sys.exit(1) else: return (dir1, dir2) if _ _name_ _ == '_ _main_ _': dir1, dir2 = getargs( ) comparedirs( dir1, dir2)
Given listings of names in two directories, this script simply
picks out unique names in the first and unique names in the second,
and reports any unique names found as differences (that is, files in
one directory but not the other). Its comparedirs
function returns a true result
if no differences were found, which is useful for detecting
differences in callers.
Let’s run this script on a few directories; differences are detected and reported as names unique in either passed-in directory pathname. Notice that this is only a structural comparison that just checks names in listings, not file contents (we’ll add the latter in a moment):
C: emp>python %X%systemfiletoolsdirdiff.py examples cpexamples
Comparing examples to cpexamples Directory lists are identical C: emp>python %X%systemfiletoolsdirdiff.py
examplesPyTools cpexamplesPyTools
Comparing examplesPyTools to cpexamplesPyTools Files unique to examplesPyTools ... visitor.py C: emp>python %X%systemfiletoolsdirdiff.py
examplesSystemFiletools
cpexamplesSystemFiletools
Comparing examplesSystemFiletools to cpexamplesSystemFiletools Files unique to examplesSystemFiletools ... dirdiff2.py Files unique to cpexamplesSystemFiletools ... cpall.py
The unique
function is the heart of this script: it performs a
simple list difference operation. Here’s how it works apart from the
rest of this script’s code:
>>>L1 = [1, 3, 5, 7, 9]
>>>L2 = [2, 3, 6, 8, 9]
>>>from dirdiff
import unique
>>>unique(L1, L2)
# items in L1 but not in L2 [1, 5, 7] >>>unique(L2, L1)
# items in L2 but not in L1 [2, 6, 8]
These two lists have objects 3
and 9
in common; the rest appear only in one of the two. When applied to
directories, unique items represent tree
differences, and common items are names of
files or subdirectories that merit further comparisons or
traversals. There are other ways to code this operation; see the
dirdiff
variants on the book’s
examples distribution for a few of them.
In fact, in Python 2.4 and later, we could also use the
built-in set
object type if we
don’t care about the order in the results (we’ll use our own
functions for now to avoid requiring users to upgrade):
>>>S1 = set([1, 3, 5, 7, 9])
>>>S2 = set([2, 3, 6, 8, 9])
>>>S1 - S2
# difference: unique set([1, 5, 7]) >>>S2 - S1
set([8, 2, 6]) >>>S1 & S2
# intersection: common set([9, 3])
Now all we need is a tree walker that applies dirdiff
at each level to pick out unique
files and directories, explicitly compares the contents of files in
common, and descends through directories in common. Example 7-30 fits the
bill.
Example 7-30. PP3ESystemFiletoolsdiffall.py
############################################################################ # Usage: "python diffall.py dir1 dir2". recursive tree comparison: # report unique files that exist in only dir1 or dir2, # report files of same name in dir1 and dir2 with differing contents, # report instances of same name but different type in dir1 and dir2, # and do the same for all subdirectories of the same names in and below # dir1 and dir2; summary of diffs appears at end of output, but search # redirected output for "DIFF" and "unique" strings for further details; # new: limit reads to 1M for large files, catch same name=file/dir; ############################################################################ import os, dirdiff blocksize = 1024 * 1024 # up to 1M per read def intersect(seq1, seq2): commons = [] # items in seq1 and seq2 for item in seq1: # or use new set( ) object if item in seq2: commons.append(item) return commons def comparedirs(dir1, dir2, diffs, verbose=False): # compare filename lists print '-'*20 if not dirdiff.comparedirs(dir1, dir2): diffs.append('unique files at %s - %s' % (dir1, dir2)) print 'Comparing contents' names1 = os.listdir(dir1) names2 = os.listdir(dir2) common = intersect(names1, names2) missed = common[:] # compare contents of files in common for name in common: path1 = os.path.join(dir1, name) path2 = os.path.join(dir2, name) if os.path.isfile(path1) and os.path.isfile(path2): missed.remove(name) file1 = open(path1, 'rb') file2 = open(path2, 'rb') while True: bytes1 = file1.read(blocksize) bytes2 = file2.read(blocksize) if (not bytes1) and (not bytes2): if verbose: print name, 'matches' break if bytes1 != bytes2: diffs.append('files differ at %s - %s' % (path1, path2)) print name, 'DIFFERS' break # recur to compare directories in common for name in common: path1 = os.path.join(dir1, name) path2 = os.path.join(dir2, name) if os.path.isdir(path1) and os.path.isdir(path2): missed.remove(name) comparedirs(path1, path2, diffs, verbose) # same name but not both files or dirs? for name in missed: diffs.append('files missed at %s - %s: %s' % (dir1, dir2, name)) print name, 'DIFFERS' if _ _name_ _ == '_ _main_ _': dir1, dir2 = dirdiff. getargs( ) mydiffs = [] comparedirs(dir1, dir2, mydiffs, True) # changes mydiffs in-place print '='*40 # walk, report diffs list if not mydiffs: print 'No diffs found.' else: print 'Diffs found:', len(mydiffs) for diff in mydiffs: print '-', diff
At each directory in the tree, this script simply runs the
dirdiff
tool to detect unique
names, and then compares names in common by intersecting directory
lists. It uses recursive function calls to traverse the tree and
visits subdirectories only after comparing all the files at each
level so that the output is more coherent to read (the trace output
for subdirectories appears after that for files; it is not
intermixed).
Notice the misses
list,
added in the third edition of this book; it’s very unlikely, but not
impossible, that the same name might be a file in one directory and
a subdirectory in the other. Also notice the blocksize
variable; as in the tree copy
script we saw earlier, instead of blindly reading entire files into
memory all at once, we limit each read to grab up to 1 MB at a time,
just in case any files in the directories are too big to be loaded
into available memory. Without this limit, I ran into MemoryError
exceptions on some machines
with a prior version of this script that read both files all at
once, like this:
bytes1 = open(path1, 'rb').read( ) bytes2 = open(path2, 'rb').read( ) if bytes1 == bytes2: ...match... else: ...difference...
This code was simpler, but is less practical for very large files that can’t fit into your available memory space (consider CD and DVD image files, for example). In the new version’s loop, the file reads return what is left when there is less than 1 MB present or remaining and return empty strings at end-of-file. Files match if all blocks read are the same, and they reach end-of-file at the same time. On some platforms, you may also want to detect and skip certain kinds of special files in order to be fully general, but these were not in my trees, so they are not in my script.
Since we’ve already studied the tree-walking tools
this script employs, let’s jump right into a few example runs. When
run on identical trees, status messages scroll during the traversal,
and a No diffs found.
message
appears at the end:
C: emp>python %X%systemfiletoolsdiffall.py examples cpexamples
--------------------
Comparing examples to cpexamples
Directory lists are identical
Comparing contents
--------------------
Comparing examplesold_Part2 to cpexamplesold_Part2
Directory lists are identical
Comparing contents
--------------------
...more lines deleted...
--------------------
Comparing examplesEmbExtRegist to cpexamplesEmbExtRegist
Directory lists are identical
Comparing contents
--------------------
Comparing examplesPyTools to cpexamplesPyTools
Directory lists are identical
Comparing contents
========================================
No diffs found.
I run this with the verbose
flag passed in as False
; use
True
to watch more status
messages fly by. To show how differences are reported, we need to
generate a few. Let’s run the global search-and-replace script we
met earlier in order to change a few files scattered about one of
the trees (see—I told you global replacement could trash your
files!):
C: empexamples>python %X%PyToolsvisitor_replace.py -exec SPAM
Are you sure?y
... 1355 => .PyToolsvisitor_find_quiet1.py 1356 => .PyToolsfixeoln_one.doc.txt Visited 1356 files Changed 8 files: .package.csh .README-PP3E.txt . eadme-old-pp1E.txt . emp . emp .InternetCgi-Webfixcgi.py .SystemProcessesoutput.txt .PyToolscleanpyc.py
While we’re at it, let’s remove a few common files so that directory uniqueness differences show up on the scope too; the following three removal commands will make two directories differ (the last two commands impact the same directory in different trees):
C: emp>rm cpexamplesPyToolsvisitor.py
C: emp>rm cpexamplesSystemFiletoolsdirdiff2.py
C: emp>rm examplesSystemFiletoolscpall.py
Now, rerun the comparison walker to pick out differences and
redirect its output report to a file for easy inspection. The
following lists just the parts of the output report that identify
differences. In typical use, I inspect the summary at the bottom of
the report first, and then search for the strings "DIFF"
and "unique"
in the report’s text if I need
more information about the differences summarized; this could be
more user-friendly, but it does the job for me:
C: emp>python %X%systemfiletoolsdiffall.py examples cpexamples > diffs
C: emp>type diffs
-------------------- Comparing examples to cpexamples Directory lists are identical Comparing contents package.csh DIFFERS README-PP3E.txt DIFFERS readme-old-pp1E.txt DIFFERS temp DIFFERS remp DIFFERS -------------------- Comparing examplesold_Part2 to cpexamplesold_Part2 Directory lists are identical Comparing contents -------------------- ... -------------------- Comparing examplesInternetCgi-Web to cpexamplesInternetCgi-Web Directory lists are identical Comparing contents fixcgi.py DIFFERS -------------------- ... -------------------- Comparing examplesSystemFiletools to cpexamplesSystemFiletools Files unique to examplesSystemFiletools ... dirdiff2.py Files unique to cpexamplesSystemFiletools ... cpall.py Comparing contents -------------------- ... -------------------- Comparing examplesSystemProcesses to cpexamplesSystemProcesses Directory lists are identical Comparing contents output.txt DIFFERS -------------------- ... -------------------- Comparing examplesPyTools to cpexamplesPyTools Files unique to examplesPyTools ... visitor.py Comparing contents cleanpyc.py DIFFERS ======================================== Diffs found: 10 - files differ at examplespackage.csh - cpexamplespackage.csh - files differ at examplesREADME-PP3E.txt - cpexamplesREADME-PP3E.txt - files differ at examples eadme-old-pp1E.txt - cpexamples eadme-old-pp1E.txt - files differ at examples emp - cpexamples emp - files differ at examples emp - cpexamples emp - files differ at examplesInternetCgi-Webfixcgi.py - cpexamplesInternetCgi-Webfixcgi.py - unique files at examplesSystemFiletools - cpexamplesSystemFiletools - files differ at examplesSystemProcessesoutput.txt - cpexamplesSystemProcessesoutput.txt - unique files at examplesPyTools - cpexamplesPyTools - files differ at examplesPyToolscleanpyc.py - cpexamplesPyToolscleanpyc.py
I added line breaks and tabs in a few of these output lines to
make them fit on this page, but the report is simple to understand.
Ten differences were found—the eight files we changed (trashed) with
the replacement script, and the two directories we threw out of sync
with the three rm
remove
commands.
So how does this script placate CD backup paranoia? To double-check my CD writer’s work, I run a command such as the following. I can also use a command like this to find out what has been changed since the last backup. Again, since the CD is “G:” on my machine when plugged in, I provide a path rooted there; use a root such as /dev/cdrom or /mnt/cdrom on Linux:
C: emp>python %X%systemfiletoolsdiffall.py
examples g:PP3rdEdexamplesPP3E > exdiffs091500
C: emp>more exdiffs091500
-------------------- Comparing examples to g:PP3rdEdexamplesPP3E Files unique to examples ... .cshrc Comparing contents tounix.py DIFFERS -------------------- Comparing examplesold_Part2 to g:PP3rdEdexamplesPP3Eold_Part2 Directory lists are identical Comparing contents -------------------- ...more visitor_fixeoln.py DIFFERS visitor_fixnames.py DIFFERS ======================================== Diffs found: 41 - unique files at examples - g:PP3rdEdexamplesPP3E - files differ at examples ounix.py - g:PP3rdEdexamplesPP3E ounix.py ...more
The CD spins, the script compares, and a summary of 41 differences appears at the end of the report (in this case, representing changes to the examples tree since the latest backup CD was burned). For an example of a full difference report, see the file exdiffs091500 in the book’s examples distribution. More typically, this is what turns up for most of my example backups; files with a leading “.” are not copied to the CD:
C: emp>python %X%SystemFiletoolsdiffall.py
examples g:PP3rdEdexamplesPP3E
... -------------------- Comparing examplesConfig to g:PP3rdEdexamplesPP3EConfig Files unique to examplesConfig ... .cshrc Comparing contents ======================================== Diffs found: 1 - unique files at examplesConfig - g:PP3rdEdexamplesPP3EConfig
And to be really sure, I run the following global comparison command against the true book directory to verify the entire book tree backup on CD:
C:>python %X%SystemFiletoolsdiffall.py PP3rdEd G:PP3rdEd
--------------------
Comparing PP3rdEd to G:PP3rdEd
Files unique to G:PP3rdEd
... examples.tar.gz
Comparing contents
README.txt DIFFERS
--------------------
...more
--------------------
Comparing PP3rdEdexamplesPP3EConfig to G:PP3rdEdexamplesPP3EConfig
Files unique to PP3rdEdexamplesPP3EConfig
... .cshrc
Comparing contents
--------------------
...more
--------------------
Comparing PP3rdEdchapters to G:PP3rdEdchapters
Directory lists are identical
Comparing contents
ch01-intro.doc DIFFERS
ch04-os3.doc DIFFERS
ch05-gui1.doc DIFFERS
ch06-gui2.doc DIFFERS
--------------------
...more
========================================
Diffs found: 11
- unique files at PP3rdEd - G:PP3rdEd
- files differ at PP3rdEdREADME.txt - G:PP3rdEdREADME.txt
...more
This particular run indicates that I’ve changed a README
file, four chapter files, and a bunch more since the last backup;
if run immediately after making a backup, only the
.cshrc unique file shows up on diffall
radar. This global comparison
can take a few minutes. It performs byte-for-byte comparisons of
all chapter files and screenshots, the examples tree, and more,
but it’s an accurate and complete verification. Given that this
book tree contained roughly 119 MB of data in 7,300 files and 570
directories the last time I checked, a more manual verification
procedure without Python’s help would be utterly
impossible.
After writing this script, I also started using it to verify
full backups of my laptops onto an external hard-drive device. To
do so, I run the cpall
copy
script we met earlier, and then the comparison script here to
check results and get a list of files that didn’t copy correctly.
The last time I did this, this procedure copied and compared
225,000 files and 15,000 directories in 20 GB of space—not the
sort of task that lends itself to manual labor!
Here are the magic incantations on my Windows laptop.
f:
is a partition on my
external hard drive, and you shouldn’t be surprised if each of
these commands runs for half an hour or more on currently common
hardware. A drag-and-drop copy takes at least as long, assuming it
works at all.
C:...SystemFiletools>cpall.py c: f: > f:copy-log.txt
C:...SystemFiletools>diffall.py f: c: > f:diff-log.txt
Finally, it’s worth noting that this script still only detects differences in the tree but does not give any further details about individual file differences. In fact, it simply loads and compares the binary contents of corresponding files with string comparisons. It’s a simple yes/no result.
If and when I need more details about how two reported files
actually differ, I either edit the files or run the file-comparison
command on the host platform (e.g., fc
on Windows/DOS, diff
or cmp
on Unix and Linux). That’s not a
portable solution for this last step; but for my purposes, just
finding the differences in a 1,300-file tree was much more critical
than reporting which lines differ in files flagged in the
report.
Of course, since we can always run shell commands in Python,
this last step could be automated by spawning a diff
or fc
command with os.popen
as differences are encountered
(or after the traversal, by scanning the report summary). The output
of these system calls could be displayed verbatim, or parsed for
relevant parts.
We also might try to do a bit better here by opening text
files in text mode to ignore line-terminator differences, but it’s
not clear that such differences should be ignored (what if the
caller wants to know whether line-end markers have been changed?).
For example, after downloading a web site with an FTP script we’ll
meet in Chapter 14, the
diffall
script detected a
discrepancy between the local copy of a file and the one at the
remote server. To probe further, I simply ran some interactive
Python code:
>>>a = open('lp2e-updates.html', 'rb').read( )
>>>b = open(r'C:MarkWEBSITEpublic_htmllp2e-updates.html', 'rb').read( )
>>>a == b
False
This verifies that there really is a binary difference in the
downloaded and local versions of the file; to see whether it’s
because a Unix or DOS line-end snuck into the file, try again in
text mode so that line ends are all mapped to the standard
character:
>>>a = open('lp2e-updates.html', 'r').read( )
>>>b = open(r'C:MarkWEBSITEpublic_htmllp2e-updates.html', 'r').read( )
>>>a == b
True
Sure enough; now, to find where the difference is, the following code checks character by character until the first mismatch is found (in binary mode, so we retain the difference):
>>>a = open('lp2e-updates.html', 'rb').read( )
>>>b = open(r'C:MarkWEBSITEpublic_htmllp2e-updates.html', 'rb').read( )
>>>for (i, (ac, bc)) in enumerate(zip(a, b)):
...if ac != bc:
...print i, repr(ac), repr(bc)
...break
... 37966 ' ' ' '
This means that at byte offset 37,966, there is a
in the downloaded file, but a
in the local copy. This line has a DOS
line end in one and a Unix line end in the other. To see more, print
text around the mismatch:
>>>for (i, (ac, bc)) in enumerate(zip(a, b)):
...if ac != bc:
...print i, repr(ac), repr(bc)
...print repr(a[i-20:i+20])
...print repr(b[i-20:i+20])
...break
... 37966 ' ' ' ' 're> def min(*args): tmp = list(arg' 're> def min(*args): tmp = list(args'
Apparently, I wound up with a Unix line end at one point in
the local copy and a DOS line end in the version I downloaded—the
combined effect of the text mode used by the download script itself
(which translated
to
) and years of edits on both Linux and
Windows PDAs and laptops (I probably coded this change on Linux and
copied it to my local Windows copy in binary mode). Code such as
this could be integrated into the diffall
script to make it more intelligent
about text files and difference reporting.
Because Python excels at processing files and strings, it’s
even possible to go one step further and code a Python equivalent of
the fc
and diff
commands. In fact, much of the work
has already been done; the standard library module difflib
, new as of Python 2.1, could make
this task simple. See the Python library manual for details and
usage examples.
We could also be smarter by avoiding the load and compare steps for files that differ in size, and we might use a smaller block size to reduce the script’s memory requirements. For most trees, such optimizations are unnecessary; reading multimegabyte files into strings is very fast in Python, and garbage collection reclaims the space as you go.
Since such extensions are beyond both this script’s scope and this chapter’s size limits, though, they will have to await the attention of a curious reader.
18.221.136.142