Engineers can be a paranoid sort (but you didn’t hear that from me). At least I am. It comes from decades of seeing things go terribly wrong, I suppose. When I create a CD backup of my hard drive, for instance, there’s still something a bit too magical about the process to trust the CD writer program to do the right thing. Maybe I should, but it’s tough to have a lot of faith in tools that occasionally trash files, and seem to crash my Windows 98 machine every third Tuesday of the month. When push comes to shove, it’s nice to be able to verify that data copied to a backup CD is the same as the original -- or at least spot deviations from the original -- as soon as possible. If a backup is ever needed, it will be really needed.
Because data CDs are accessible as simple directory trees, we are
once again in the realm of tree walkers -- to verify a backup CD,
we simply need to walk its top-level directory. We’ve already
written a generic walker class (the visitor
module), but it won’t help us here directly: we need to walk
two directories in parallel and inspect common
files along the way. Moreover, walking either one of the two
directories won’t allow us to spot files and directories that
only exist in the other. Something more custom seems in order here.
Before we start coding, the first thing we need to clarify is what it means to compare two directory trees. If both trees have exactly the same branch structure and depth, this problem reduces to comparing corresponding files in each tree. In general, though, the trees can have arbitrarily different shapes, depth, and so on.
More generally, the contents of a directory in one tree may have more or fewer entries than the corresponding directory in the other tree. If those differing contents are filenames, there is no corresponding file to compare; if they are directory names, there is no corresponding branch to descend through. In fact, the only way to detect files and directories that appear in one tree but not the other is to detect differences in each level’s directory.
In other words, a tree comparison algorithm will also have to perform directory comparisons along the way. Because this is a nested, and simpler operation, let’s start by coding a single-directory comparison of filenames in Example 5-24.
Example 5-24. PP2ESystemFiletoolsdirdiff.py
#!/bin/env python ######################################################## # use: python dirdiff.py dir1-path dir2-path # compare two directories to find files that exist # in one but not the other; this version uses the # os.listdir function and list difference; note # that this script only checks filename, not file # contents--see diffall.py for an extension that # does the latter by comparing .read( ) results; ######################################################## import os, sys def reportdiffs(unique1, unique2, dir1, dir2): if not (unique1 or unique2): print 'Directory lists are identical' else: if unique1: print 'Files unique to', dir1 for file in unique1: print '...', file if unique2: print 'Files unique to', dir2 for file in unique2: print '...', file def unique(seq1, seq2): uniques = [] # return items in seq1 only for item in seq1: if item not in seq2: uniques.append(item) return uniques def comparedirs(dir1, dir2): print 'Comparing', dir1, 'to', dir2 files1 = os.listdir(dir1) files2 = os.listdir(dir2) unique1 = unique(files1, files2) unique2 = unique(files2, files1) reportdiffs(unique1, unique2, dir1, dir2) return not (unique1 or unique2) # true if no diffs def getargs( ): try: dir1, dir2 = sys.argv[1:] # 2 command-line args except: print 'Usage: dirdiff.py dir1 dir2' sys.exit(1) else: return (dir1, dir2) if __name__ == '__main__': dir1, dir2 = getargs( ) comparedirs(dir1, dir2)
Given listings of names in two directories, this script simply picks
out unique names in the first, unique names in the second, and
reports any unique names found as differences (that is, files in one
directory but not the other). Its comparedirs
function returns a true result if no differences were
found -- useful for detecting differences in callers.
Let’s run this script on a few directories; differences are detected and reported as names unique in either passed-in directory pathname. Notice that this is only a structural comparison that just checks names in listings, not file contents (we’ll add the latter in a moment):
C: emp>python %X%systemfiletoolsdirdiff.py examples cpexamples
Comparing examples to cpexamples Directory lists are identical C: emp>python %X%systemfiletoolsdirdiff.py
examplesPyTools cpexamplesPyTools
Comparing examplesPyTools to cpexamplesPyTools Files unique to examplesPyTools ... visitor.py C: emp>python %X%systemfiletoolsdirdiff.py
examplesSystemFiletools
cpexamplesSystemFiletools
Comparing examplesSystemFiletools to cpexamplesSystemFiletools Files unique to examplesSystemFiletools ... dirdiff2.py Files unique to cpexamplesSystemFiletools ... cpall.py
The unique
function is the heart of this script:
it performs a simple list difference operation. Here’s how it
works apart from the rest of this script’s code:
>>>L1 = [1, 3, 5, 7, 9]
>>>L2 = [2, 3, 6, 8, 9]
>>>from dirdiff import unique
>>>unique(L1, L2)
# items in L1 but not in L2 [1, 5, 7] >>>unique(L2, L1)
# items in L2 but not in L1 [2, 6, 8]
These two lists have objects 3 and 9 in common; the rest appear only
in one of the two. When applied to directories,
unique items represent tree differences, and
common items are names of files or
subdirectories that merit further comparisons or traversal. There are
other ways to check this code; see the dirdiff
variants in the book’s CD for a few.
Now all we need is a tree walker that applies
dirdiff
at each level to pick out unique files and
directories, explicitly compares the contents of files in common, and
descends through directories in common. Example 5-25
fits the bill.
Example 5-25. PP2ESystemFiletoolsdiffall.py
######################################################### # Usage: "python diffall.py dir1 dir2". # recursive tree comparison--report files that exist # in only dir1 or dir2, report files of same name in # dir1 and dir2 with differing contents, and do the # same for all subdirectories of the same names in # and below dir1 and dir2; summary of diffs appears # at end of output (but search redirected output for # "DIFF" and "unique" strings for further details); ######################################################### import os, dirdiff def intersect(seq1, seq2): commons = [] # items in seq1 and seq2 for item in seq1: if item in seq2: commons.append(item) return commons def comparedirs(dir1, dir2, diffs, verbose=0): # compare filename lists print '-'*20 if not dirdiff.comparedirs(dir1, dir2): diffs.append('unique files at %s - %s' % (dir1, dir2)) print 'Comparing contents' files1 = os.listdir(dir1) files2 = os.listdir(dir2) common = intersect(files1, files2) # compare contents of files in common for file in common: path1 = os.path.join(dir1, file) path2 = os.path.join(dir2, file) if os.path.isfile(path1) and os.path.isfile(path2): bytes1 = open(path1, 'rb').read( ) bytes2 = open(path2, 'rb').read( ) if bytes1 == bytes2: if verbose: print file, 'matches' else: diffs.append('files differ at %s - %s' % (path1, path2)) print file, 'DIFFERS' # recur to compare directories in common for file in common: path1 = os.path.join(dir1, file) path2 = os.path.join(dir2, file) if os.path.isdir(path1) and os.path.isdir(path2): comparedirs(path1, path2, diffs, verbose) if __name__ == '__main__': dir1, dir2 = dirdiff.getargs( ) mydiffs = [] comparedirs(dir1, dir2, mydiffs) # changes mydiffs in-place print '='*40 # walk, report diffs list if not mydiffs: print 'No diffs found.' else: print 'Diffs found:', len(mydiffs) for diff in mydiffs: print '-', diff
At each directory in the tree, this script simply runs the
dirdiff
tool to detect unique names, and then
compares names in common by intersecting directory lists. Since
we’ve already studied the tree-walking tools this script
employs, let’s jump right into a few example runs. When run on
identical trees, status messages scroll during the traversal, and a
“No diffs found” message appears at the end:
C: emp>python %X%systemfiletoolsdiffall.py examples cpexamples
--------------------
Comparing examples to cpexamples
Directory lists are identical
Comparing contents
--------------------
Comparing examplesold_Part2 to cpexamplesold_Part2
Directory lists are identical
Comparing contents
--------------------
...more lines deleted...
--------------------
Comparing examplesEmbExtRegist to cpexamplesEmbExtRegist
Directory lists are identical
Comparing contents
--------------------
Comparing examplesPyTools to cpexamplesPyTools
Directory lists are identical
Comparing contents
========================================
No diffs found.
To show how differences are reported, we need to generate a few. Let’s run the global search-and-replace script we met earlier, to change a few files scattered about one of the trees (see -- I told you global replacement could trash your files!):
C: empexamples>python %X%PyToolsvisitor_replace.py -exec SPAM
Are you sure?y
... 1355 => .PyToolsvisitor_find_quiet1.py 1356 => .PyToolsfixeoln_one.doc.txt Visited 1356 files Changed 8 files: .package.csh .README-PP2E.txt . eadme-old-pp1E.txt . emp . emp .InternetCgi-Webfixcgi.py .SystemProcessesoutput.txt .PyToolscleanpyc.py
While we’re at it, let’s remove a few common files so directory uniqueness differences show up on the scope too; the following three removal commands will make two directories differ (the last two commands impact the same directory in different trees):
C: emp>rm cpexamplesPyToolsvisitor.py
C: emp>rm cpexamplesSystemFiletoolsdirdiff2.py
C: emp>rm examplesSystemFiletoolscpall.py
Now, rerun the comparison walker to pick out differences, and pipe its output report to a file for easy inspection. The following lists just the parts of the output report that identify differences. In typical use, I inspect the summary at the bottom of the report first, and then search for strings “DIFF” and “unique” in the report’s text if I need more information about the differences summarized:
C: emp>python %X%systemfiletoolsdiffall.py examples cpexamples > diffs
C: emp>type diffs
-------------------- Comparing examples to cpexamples Directory lists are identical Comparing contents package.csh DIFFERS README-PP2E.txt DIFFERS readme-old-pp1E.txt DIFFERS temp DIFFERS remp DIFFERS -------------------- Comparing examplesold_Part2 to cpexamplesold_Part2 Directory lists are identical Comparing contents -------------------- ... -------------------- Comparing examplesInternetCgi-Web to cpexamplesInternetCgi-Web Directory lists are identical Comparing contents fixcgi.py DIFFERS -------------------- ... -------------------- Comparing examplesSystemFiletools to cpexamplesSystemFiletools Files unique to examplesSystemFiletools ... dirdiff2.py Files unique to cpexamplesSystemFiletools ... cpall.py Comparing contents -------------------- ... -------------------- Comparing examplesSystemProcesses to cpexamplesSystemProcesses Directory lists are identical Comparing contents output.txt DIFFERS -------------------- ... -------------------- Comparing examplesPyTools to cpexamplesPyTools Files unique to examplesPyTools ... visitor.py Comparing contents cleanpyc.py DIFFERS ======================================== Diffs found: 10 - files differ at examplespackage.csh - cpexamplespackage.csh - files differ at examplesREADME-PP2E.txt - cpexamplesREADME-PP2E.txt - files differ at examples eadme-old-pp1E.txt - cpexamples eadme-old-pp1E.txt - files differ at examples emp - cpexamples emp - files differ at examples emp - cpexamples emp - files differ at examplesInternetCgi-Webfixcgi.py - cpexamplesInternetCgi-Webfixcgi.py - unique files at examplesSystemFiletools - cpexamplesSystemFiletools - files differ at examplesSystemProcessesoutput.txt - cpexamplesSystemProcessesoutput.txt - unique files at examplesPyTools - cpexamplesPyTools - files differ at examplesPyToolscleanpyc.py - cpexamplesPyToolscleanpyc.py
I added line breaks and tabs in a few of these output lines to make them fit on this page, but the report is simple to understand. Ten differences were found -- the eight files we changed (trashed) with the replacement script, and the two directories we threw out of sync with the three rm remove commands.
So how does this script placate CD backup paranoia? To double-check
my CD writer’s work, I run a command like the following. I can
also use a command like this to find out what has been changed since
the last backup. Again, since the CD is “G:” on my
machine when plugged in, I provide a path rooted there; use a root
such as /dev/cdrom
on Linux:
C: emp>python %X%systemfiletoolsdiffall.py
examples g:PP2ndEdexamplesPP2E > exdiffs091500
C: emp>more exdiffs091500
-------------------- Comparing examples to g:PP2ndEdexamplesPP2E Files unique to examples ... .cshrc Comparing contents tounix.py DIFFERS -------------------- Comparing examplesold_Part2 to g:PP2ndEdexamplesPP2Eold_Part2 Directory lists are identical Comparing contents -------------------- ...more visitor_fixeoln.py DIFFERS visitor_fixnames.py DIFFERS ======================================== Diffs found: 41 - unique files at examples - g:PP2ndEdexamplesPP2E - files differ at examples ounix.py - g:PP2ndEdexamplesPP2E ounix.py ...more
The CD spins, the script compares, and a summary of 41 differences
appears at the end of the report (in this case, representing changes
to the examples tree since the latest backup CD was burned). For an
example of a full difference report, see file
exdiffs091500
on the book’s CD. More
typically, this is what turns up for most of my example
backups -- files with a leading “.” are not copied to
the CD:
C: emp>python %X%SystemFiletoolsdiffall.py
examples g:PP2ndEdexamplesPP2E
... -------------------- Comparing examplesConfig to g:PP2ndEdexamplesPP2EConfig Files unique to examplesConfig ... .cshrc Comparing contents ======================================== Diffs found: 1 - unique files at examplesConfig - g:PP2ndEdexamplesPP2EConfig
And to really be sure, I run the following global comparison command against the true book directory, to verify the entire book tree backup on CD:
C:>python %X%SystemFiletoolsdiffall.py PP2ndEd G:PP2ndEd
--------------------
Comparing PP2ndEd to G:PP2ndEd
Files unique to G:PP2ndEd
... examples.tar.gz
Comparing contents
README.txt DIFFERS
--------------------
...more
--------------------
Comparing PP2ndEdexamplesPP2EConfig to G:PP2ndEdexamplesPP2EConfig
Files unique to PP2ndEdexamplesPP2EConfig
... .cshrc
Comparing contents
--------------------
...more
--------------------
Comparing PP2ndEdchapters to G:PP2ndEdchapters
Directory lists are identical
Comparing contents
ch01-intro.doc DIFFERS
ch04-os3.doc DIFFERS
ch05-gui1.doc DIFFERS
ch06-gui2.doc DIFFERS
--------------------
...more
========================================
Diffs found: 11
- unique files at PP2ndEd - G:PP2ndEd
- files differ at PP2ndEdREADME.txt - G:PP2ndEdREADME.txt
...more
This particular run indicates that I’ve changed a
“readme” file, four chapter files, and a bunch more since
the last backup; if run immediately after making a backup, only the
.cshrc
unique file shows up on
diffall
radar. This global comparison can take a
few minutes -- it performs byte-for-byte comparisons of all
chapter files and screen shots, the examples tree, an image of the
book’s CD, and more, but it’s an accurate and complete
verification. Given that this book tree contained roughly 119M of
data in 7300 files and 570 directories the last time I checked, a
more manual verification procedure without Python’s help would
be utterly impossible.
Finally, it’s worth noting that this script still only detects differences in the tree, but does not give any further details about individual file differences. In fact, it simply loads and compares the binary contents of corresponding files with a single string comparison -- it’s a simple yes/no result.[47] If and when I need more details about how two reported files actually differ, I either edit the files, or run the file-comparison command on the host platform (e.g., fc on Windows/DOS, diff or cmp on Unix and Linux). That’s not a portable solution for this last step; but for my purposes, just finding the differences in a 1300-file tree was much more critical than reporting which lines differ in files flagged in the report.
Of course, since we can always run shell commands in Python, this
last step could be automated by spawning a diff or
fc command with os.popen
as
differences are encountered (or after the traversal, by scanning the
report summary). Because Python excels at processing files and
strings, though, it’s possible to go one step further and code
a Python equivalent of the fc and
diff commands. Since this is beyond both this
script’s scope and this chapter’s size limits, that will
have to await the attention of a curious reader.
[47] We might try to do a bit better here,
by opening text files in text mode to ignore line-terminator
differences, but it’s not clear that such differences should be
blindly ignored (what if the caller wants to know if line-end markers
have been changed?). We could also be smarter by avoiding the load
and compare steps for files that differ in size, and read files in
small chunks, instead of all at once, to minimize memory requirements
for huge files (see earlier examples such as the
cpall
script for hints). For my comparisons, such
optimizations are unnecessary.
3.145.156.122