When I wrote the first edition of this book, I shipped two copies of every example file on the CD-ROM—one with Unix line-end markers and one with DOS markers. The idea was that this would make it easy to view and edit the files on either platform. Readers would simply copy the examples directory tree designed for their platform onto their hard drive and ignore the other one.
If you read Chapter 4, you
know the issue here: DOS (and by proxy, Windows) marks line
ends in text files with the two characters
(carriage return, line feed), but Unix
uses just a single
. Most modern
text editors don’t care—they happily display text files encoded in
either format. Some tools are less forgiving, though. I still
occasionally see the odd
character when viewing DOS files on Unix, or an entire file in a
single line when looking at Unix files on DOS (the Notepad accessory
does this on Windows, for example).
Because this is only an occasional annoyance, and because it’s easy to forget to keep two distinct example trees in sync, I adopted a different policy as of the book’s second edition: we’re shipping a single copy of the examples (in DOS format), along with a portable converter tool for changing to and from other line-end formats.
The main obstacle, of course, is how to go about providing a
portable and easy-to-use converter—one that runs “out of the box” on
almost every computer, without changes or recompiles. Some Unix
platforms have commands such as fromdos
and dos2unix
, but they are not universally
available even on Unix. DOS batch files and csh
scripts could do the job on Windows and
Unix, respectively, but neither solution works on both
platforms.
Fortunately, Python does. The scripts presented in Examples 7-1, 7-3, and 7-4 convert end-of-line markers between DOS and Unix formats; they convert a single file, a directory of files, and a directory tree of files. In this section, we briefly look at each script and contrast some of the system tools they apply. Each reuses the prior script’s code and becomes progressively more powerful in the process.
The last of these three scripts, Example 7-4, is the portable converter tool I was looking for; it converts line ends in the entire examples tree, in a single step. Because it is pure Python, it also works on both DOS and Unix unchanged; as long as Python is installed, it is the only line converter you may ever need to remember.
These three scripts were developed in stages on purpose, so that I could focus on getting line-feed conversions right before worrying about directories and tree walking logic. With that scheme in mind, Example 7-1 addresses just the task of converting lines in a single text file.
Example 7-1. PP3EPyToolsfixeoln_one.py
############################################################################## # Use: "python fixeoln_one.py [tounix|todos] filename". # Convert end-of-lines in the single text file whose name is passed in on the # command line, to the target format (tounix or todos). The _one, _dir, and # _all converters reuse the convert function here. convertEndlines changes # end-lines only if necessary: lines that are already in the target format # are left unchanged, so it's OK to convert a file > once with any of the # 3 fixeoln scripts. Note: must use binary file open modes for this to work # on Windows, else default text mode automatically deletes the on reads, # and adds extra for each on writes; dee PyToolsdumpfile.py raw bytes; ############################################################################## import os listonly = False # True=show file to be changed, don't rewrite def convertEndlines(format, fname): # convert one file if not os.path.isfile(fname): # todos: => print 'Not a text file', fname # tounix: => return # skip directory names newlines = [] changed = 0 for line in open(fname, 'rb').readlines( ): # use binary i/o modes if format == 'todos': # else lost on Win if line[-1:] == ' ' and line[-2:-1] != ' ': line = line[:-1] + ' ' changed = 1 elif format == 'tounix': # avoids IndexError if line[-2:] == ' ': # slices are scaled line = line[:-2] + ' ' changed = 1 newlines.append(line) if changed: try: # might be read-only print 'Changing', fname if not listonly: open(fname, 'wb').writelines (newlines) except IOError, why: print 'Error writing to file %s: skipped (%s)' % (fname, why) if _ _name_ _ == '_ _main_ _': import sys errmsg = 'Required arguments missing: ["todos"|"tounix"] filename' assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg convertEndlines(sys.argv[1], sys.argv[2]) print 'Converted', sys.argv[2]
This script is fairly straightforward as system utilities go;
it relies primarily on the built-in file object’s methods. Given a
target format flag and filename, it loads the file into a lines list
using the readlines
method,
converts input lines to the target format if needed, and writes the
result back to the file with the writelines
method if any lines were
changed:
C: empexamples>python %X%PyToolsfixeoln_one.py tounix PyDemos.pyw
Changing PyDemos.pyw Converted PyDemos.pyw C: empexamples>python %X%PyToolsfixeoln_one.py todos PyDemos.pyw
Changing PyDemos.pyw Converted PyDemos.pyw C: empexamples>fc PyDemos.pyw %X%PyDemos.pyw
Comparing files PyDemos.pyw and C:PP3rdEdexamplesPP3EPyDemos.pyw FC: no differences encountered C: empexamples>python %X%PyToolsfixeoln_one.py todos PyDemos.pyw
Converted PyDemos.pyw C: empexamples>python %X%PyToolsfixeoln_one.py toother nonesuch.txt
Traceback (innermost last): File "C:PP3rdEdexamplesPP3EPyToolsfixeoln_one.py", line 45, in ? assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg AssertionError: Required arguments missing: ["todos"|"tounix"] filename
Here, the first command converts the file to Unix line-end
format (tounix
), and the second
and fourth convert to the DOS convention—all regardless of the
platform on which this script is run. To make typical usage easier,
converted text is written back to the file in
place, instead of to a newly created output file. Notice
that this script’s filename has an _
(underscore) in it, not a - (hyphen); because
it is meant to be both run as a script and imported as a library,
its filename must translate to a legal Python
variable name in importers
(fixeoln-one.py won’t work for both
roles).
In all the examples in this chapter that change files in directory trees, the C: empexamples and C: empcpexamples directories used in testing are full copies of the real PP3E examples root directory. I don’t always show the copy commands used to create these test directories along the way (at least not until we’ve written our own in Python).
The fc
DOS file-compare
command in the preceding interaction confirms the conversions, but
to better verify the results of this Python script, I wrote
another, shown in Example
7-2.
Example 7-2. PP3EPyToolsdumpfile.py
import sys bytes = open(sys.argv[1], 'rb').read( ) print '-'*40 print repr(bytes) print '-'*40 while bytes: bytes, chunk = bytes[4:], bytes[:4] # show four bytes per line for c in chunk: print oct(ord(c)), ' ', # show octal of binary value print print '-'*40 for line in open(sys.argv[1], 'rb').readlines( ): print repr(line)
To give a clear picture of a file’s contents, this script
opens a file in binary mode (to suppress automatic line-feed
conversions), prints its raw contents (bytes
) all at once, displays the octal
numeric ASCII codes of it contents four bytes per line, and shows
its raw lines. Let’s use this to trace conversions. First of all,
use a simple text file to make wading through bytes a bit more
humane:
C: emp>type test.txt
a b c C: emp>python %X%PyToolsdumpfile.py test.txt
---------------------------------------- 'a b c ' ---------------------------------------- 0141 015 012 0142 015 012 0143 015 012 ---------------------------------------- 'a ' 'b ' 'c '
The test.txt file here is in DOS
line-end format; the escape sequence
is simply the DOS line-end marker.
Now, converting to Unix format changes all the DOS
markers to a single
as advertised:
C: emp>python %X%PyToolsfixeoln_one.py tounix test.txt
Changing test.txt Converted test.txt C: emp>python %X%PyToolsdumpfile.py test.txt
---------------------------------------- 'a b c ' ---------------------------------------- 0141 012 0142 012 0143 012 ---------------------------------------- 'a ' 'b ' 'c '
And converting back to DOS restores the original file format:
C: emp>python %X%PyToolsfixeoln_one.py todos test.txt
Changing test.txt Converted test.txt C: emp>python %X%PyToolsdumpfile.py test.txt
---------------------------------------- 'a b c ' ---------------------------------------- 0141 015 012 0142 015 012 0143 015 012 ---------------------------------------- 'a ' 'b ' 'c ' C: emp>python %X%PyToolsfixeoln_one.py todos test.txt
# makes no changes Converted test.txt
Notice that no “Changing” message is emitted for the last
command just run because no changes were actually made to the file
(it was already in DOS format). Because this program is smart
enough to avoid converting a line that is already in the target
format, it is safe to rerun on a file even if you can’t recall
what format the file already uses. More naïve conversion logic
might be simpler, but it may not be repeatable. For instance, a
replace
string method call can
be used to expand a Unix
to
a DOS
, but only
once:
>>>lines = 'aaa bbb ccc '
>>>lines = lines.replace(' ', ' ')
# OK: added >>>lines
'aaa bbb ccc ' >>>lines = lines.replace(' ', ' ')
# bad: double >>>lines
'aaa bbb ccc '
Such logic could easily trash a file if applied to it twice.[*] To really understand how the script gets around this problem, though, we need to take a closer look at its use of slices and binary file modes.
This script relies on subtle aspects of string slicing behavior to inspect parts of each line without size checks. For instance:
The expression line[-2:]
returns the last two
characters at the end of the line (or one or zero characters,
if the line isn’t at least two characters long).
A slice such as line[-2:-1]
returns the
second-to-last character (or an empty string if the line is
too small to have a second-to-last character).
The operation line[:-2]
returns all characters
except the last two at the end (or an empty string if there
are fewer than three characters).
Because out-of-bounds slices scale slice limits to be inbounds, the script doesn’t need to add explicit tests to guarantee that the line is big enough to have end-line characters at the end. For example:
>>>'aaaXY'[-2:], 'XY'[-2:], 'Y'[-2:], ''[-2:]
('XY', 'XY', 'Y', '') >>>'aaaXY'[-2:-1], 'XY'[-2:-1], 'Y'[-2:-1], ''[-2:-1]
('X', 'X', '', '') >>>'aaaXY'[:-2], 'aaaY'[:-1], 'XY'[:-2], 'Y'[:-1]
('aaa', 'aaa', '', '')
If you imagine characters such as
and
rather than the X
and Y
here, you’ll understand how the script
exploits slice scaling to good effect.
Because this script aims to be portable to Windows,
it also takes care to open files in binary mode, even though they
contain text data. As we’ve seen, when files are opened in text
mode on Windows,
is stripped
from
markers on input, and
is added before
markers on output. This automatic
conversion allows scripts to represent the end-of-line marker as
on all platforms. Here,
though, it would also mean that the script would never see the
it’s looking for to detect a
DOS-encoded line because the
would be dropped before it ever reached the script:
>>>open('temp.txt', 'w').writelines(['aaa ', 'bbb '])
>>>open('temp.txt', 'rb').read( )
'aaa bbb ' >>>open('temp.txt', 'r').read( )
'aaa bbb '
Without binary open mode, this can lead to fairly subtle and
incorrect behavior on Windows. For example, if files are opened in
text mode, converting in todos
mode on Windows would actually produce double
characters: the script might convert
the stripped
to
, which is then expanded on output
to
!
>>>open('temp.txt', 'w').writelines(['aaa ', 'bbb '])
>>>open('temp.txt', 'rb').read( )
'aaa bbb '
With binary mode, the script inputs a full
, so no conversion is performed.
Binary mode is also required for output on Windows in order to
suppress the insertion of
characters; without it, the tounix
conversion would fail on that
platform.[*]
If all that is too subtle to bear, just remember to use the
b
in file open mode strings if
your scripts might be run on Windows, and that you mean to process
either true binary data or text data as it is actually stored in
the file.
Armed with a fully debugged single file converter, it’s an
easy step to add support for converting all files in a single
directory. Simply call the single file converter on every filename
returned by a directory listing tool. The script in Example 7-3 uses the glob
module we met in Chapter 4 to grab a list of files to
convert.
Example 7-3. PP3EPyToolsfixeoln_dir.py
########################################################################## # Use: "python fixeoln_dir.py [tounix|todos] patterns?". # convert end-lines in all the text files in the current directory # (only: does not recurse to subdirectories). Reuses converter in the # single-file version, file_one. ########################################################################## import sys, glob from fixeoln_one import convertEndlines listonly = 0 patts = ['*.py', '*.pyw', '*.txt', '*.cgi', '*.html', # text filenames '*.c', '*.cxx', '*.h', '*.i', '*.out', # in this package 'README*', 'makefile*', 'output*', '*.note'] if _ _name_ _ == '_ _main_ _': errmsg = 'Required first argument missing: "todos" or "tounix"' assert (len(sys.argv) >= 2 and sys.argv[1] in ['todos', 'tounix']), errmsg if len(sys.argv) > 2: # glob anyhow: '*' not applied on DOS patts = sys.argv[2:] # though not really needed on Linux filelists = map(glob.glob, patts) # name matches in this dir only count = 0 for list in filelists: for fname in list: if listonly: print count+1, '=>', fname else: convertEndlines(sys.argv[1], fname) count += 1 print 'Visited %d files' % count
This module defines a list, patts
, containing filename patterns that
match all the kinds of text files that appear in the book examples
tree; each pattern is passed to the built-in glob.glob
call by map
to be separately expanded into a list
of matching files. That’s why there are nested for
loops near the end. The outer loop
steps through each glob
result
list, and the inner steps through each name within each list. Try
the map
call interactively if
this doesn’t make sense:
>>>import glob
>>>map(glob.glob, ['*.py', '*.html'])
[['helloshell.py'], ['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html']]
This script requires a convert mode flag on the command line
and assumes that it is run in the directory where files to be
converted live; cd
to the
directory to be converted before running this script (or change it
to accept a directory name argument too):
C: empexamples>python %X%PyToolsfixeoln_dir.py tounix
Changing Launcher.py Changing Launch_PyGadgets.py Changing LaunchBrowser.py ...lines deleted... Changing PyDemos.pyw Changing PyGadgets_bar.pyw Changing README-PP3E.txt Visited 21 files C: empexamples>python %X%PyToolsfixeoln_dir.py todos
Changing Launcher.py Changing Launch_PyGadgets.py Changing LaunchBrowser.py ...lines deleted... Changing PyDemos.pyw Changing PyGadgets_bar.pyw Changing README-PP3E.txt Visited 21 files C: empexamples>python %X%PyToolsfixeoln_dir.py todos
# makes no changes Visited 21 files C: empexamples>fc PyDemos.pyw %X%PyDemos.pyw
Comparing files PyDemos.pyw and C:PP3rdEdexamplesPP3EPyDemos.pyw FC: no differences encountered
Notice that the third command generated no “Changing” messages
again. Because the convertEndlines
function of the
single-file module is reused here to perform the actual updates,
this script inherits that function’s
repeatability: it’s OK to rerun this script on
the same directory any number of times. Only lines that require
conversion will be converted. This script also accepts an optional
list of filename patterns on the command line in order to override
the default patts
list of files
to be changed:
C: empexamples>python %X%PyToolsfixeoln_dir.py tounix *.pyw *.csh
Changing echoEnvironment.pyw Changing Launch_PyDemos.pyw Changing Launch_PyGadgets_bar.pyw Changing PyDemos.pyw Changing PyGadgets_bar.pyw Changing cleanall.csh Changing makeall.csh Changing package.csh Changing setup-pp.csh Changing setup-pp-embed.csh Changing xferall.linux.csh Visited 11 files C: empexamples>python %X%PyToolsfixeoln_dir.py tounix *.pyw *.csh
Visited 11 files
Also notice that the single-file script’s convertEndlines
function performs an initial os.path.isfile
test to make sure the
passed-in filename represents a file, not a
directory; when we start globbing with patterns to collect files to
convert, it’s not impossible that a pattern’s expansion might
include the name of a directory along with the desired files.
Unix and Linux users: Unix-like shells automatically glob
(i.e., expand) filename pattern operators like *
in command lines before they ever
reach your script. You generally need to
quote such patterns to pass them in to
scripts verbatim (e.g., "*.py"
). The fixeoln_dir
script will still work if
you don’t. Its glob.glob
calls
will simply find a single matching filename for each already
globbed name, and so have no effect:
>>>glob.glob('PyDemos.pyw')
['PyDemos.pyw']
Patterns are not preglobbed in the DOS shell, though, so the
glob.glob
calls here are still
a good idea in scripts that aspire to be as portable as this
one.
Finally, Example
7-4 applies what we’ve already learned to an entire directory
tree. It simply runs the file-converter function to every filename
produced by tree-walking logic. In fact, this script really just
orchestrates calls to the original and already debugged convertEndlines
function.
Example 7-4. PP3EPyToolsfixeoln_all.py
############################################################################## # Use: "python fixeoln_all.py [tounix|todos] patterns?". # find and convert end-of-lines in all text files at and below the directory # where this script is run (the dir you are in when you type the command). # If needed, tries to use the Python find.py library module, else reads the # output of a Unix-style find command; uses a default filename patterns list # if patterns argument is absent. This script only changes files that need # to be changed, so it's safe to run brute force from a root-level dir. ############################################################################## import os, sys debug = False pyfind = False # force py find listonly = False # True=show find results only def findFiles(patts, debug=debug, pyfind=pyfind): try: if sys.platform[:3] == 'win' or pyfind: print 'Using Python find' try: import find # use python-code find.py except ImportError: # use mine if deprecated! from PP3E.PyTools import find # may get from my dir anyhow matches = map(find.find, patts) # startdir default = '.' else: print 'Using find executable' matches = [] for patt in patts: findcmd = 'find . -name "%s" -print' % patt # run find command lines = os.popen(findcmd).readlines( ) # remove endlines matches.append(map(str.strip, lines)) # lambda x: x[:-1] except: assert 0, 'Sorry - cannot find files' if debug: print matches return matches if _ _name_ _ == '_ _main_ _': from fixeoln_dir import patts from fixeoln_one import convertEndlines errmsg = 'Required first argument missing: "todos" or "tounix"' assert (len(sys.argv) >= 2 and sys.argv[1] in ['todos', 'tounix']), errmsg if len(sys.argv) > 2: # quote in Unix shell patts = sys.argv[2:] # else tries to expand matches = findFiles(patts) count = 0 for matchlist in matches: # a list of lists for fname in matchlist: # one per pattern if listonly: print count+1, '=>', fname else: convertEndlines(sys.argv[1], fname) count += 1 print 'Visited %d files' % count
On Windows, the script uses the portable find.find
built-in tool we built in Chapter 4 (the hand-rolled equivalent
of Python’s original find
module)[*] to generate a list of all matching file and directory
names in the tree; on other platforms, it resorts to spawning a less
portable and perhaps slower find
shell command just for illustration purposes.
Once the file pathname lists are compiled, this script simply
converts each found file in turn using the single-file converter
module’s tools. Here is the collection of scripts at work converting
the book examples tree on Windows; notice that this script also
processes the current working directory (CWD;
cd
to the directory to be
converted before typing the command line), and that Python treats
forward and backward slashes the same way in the program
filename:
C: empexamples>python %X%/PyTools/fixeoln_all.py tounix
Using Python find Changing .LaunchBrowser.py Changing .Launch_PyGadgets.py Changing .Launcher.py Changing .Othercgimail.py ...lots of lines deleted... Changing .EmbExtExportsClassAndModoutput.prog1 Changing .EmbExtExportsoutput.prog1 Changing .EmbExtRegistoutput Visited 1051 files C: empexamples>python %X%/PyTools/fixeoln_all.py todos
Using Python find Changing .LaunchBrowser.py Changing .Launch_PyGadgets.py Changing .Launcher.py Changing .Othercgimail.py ...lots of lines deleted... Changing .EmbExtExportsClassAndModoutput.prog1 Changing .EmbExtExportsoutput.prog1 Changing .EmbExtRegistoutput Visited 1051 files C: empexamples>python %X%/PyTools/fixeoln_all.py todos
Using Python find Not a text file .EmbedInventoryOutput Not a text file .EmbedInventoryWithDbaseOutput Visited 1051 files
This script and its ancestors are shipped in the book’s example distribution as that portable converter tool I was looking for. To convert all example files in the tree to Unix line-terminator format, simply copy the entire PP3E examples tree to some “examples” directory on your hard drive and type these two commands in a shell:
cd examples/PP3E
python PyTools/fixeoln_all.py tounix
Of course, this assumes Python is already installed (see the
example distribution’s README file for details) but will work on
almost every platform in use today. To convert back to DOS, just
replace tounix
with todos
and rerun. I ship this tool with a
training CD for Python classes I teach too; to convert those
files, we simply type:
cd HtmlExamples
python ....Toolsfixeoln_all.py tounix
Once you get accustomed to the command lines, you can use
this in all sorts of contexts. Finally, to make the conversion
easier for beginners to run, the top-level examples directory
includes tounix.py and
todos.py scripts that can be simply
double-clicked in a file explorer GUI; Example 7-5 shows the tounix
converter.
Example 7-5. PP3E ounix.py
#!/usr/local/bin/python ###################################################################### # Run me to convert all text files to Unix/Linux line-feed format. # You only need to do this if you see odd ' ' characters at the end # of lines in text files in this distribution, when they are viewed # with your text editor (e.g., vi). This script converts all files # at and below the examples root, and only converts files that have # not already been converted (it's OK to run this multiple times). # # Since this is a Python script which runs another Python script, # you must install Python first to run this program; then from your # system command line (e.g., a xterm window), cd to the directory # where this script lives, and then type "python tounix.py". You # may also be able to simply click on this file's icon in your file # system explorer, if it knows what '.py' files are. ###################################################################### import os prompt = """ This program converts all text files in the book examples distribution to UNIX line-feed format. Are you sure you want to do this (y=yes)? """ answer = raw_input(prompt) if answer not in ['y', 'Y', 'yes']: print 'Cancelled' else: os.system('python PyTools/fixeoln_all.py tounix')
This script addresses the end user’s perception of usability, but other factors impact programmer usability—just as important to systems that will be read or changed by others. For example, the file, directory, and tree converters are coded in separate script files, but there is no law against combining them into a single program that relies on a command-line arguments pattern to know which of the three modes to run. The first argument could be a mode flag, tested by such a program:
if mode == '-one': ... elif mode == '-dir': ... elif mode == '-all: ...
That seems more confusing than separate files per mode, though; it’s usually much easier to botch a complex command line than to type a specific program file’s name. It will also make for a confusing mix of global names and one very big piece of code at the bottom of the file. As always, simpler is usually better.
[*] In fact, see the files old_todos.py,
old_tounix.py, and
old_toboth.py in the
PyTools directory in the examples
distribution for a complete earlier implementation built
around replace
. It was
repeatable for to-Unix changes, but not for to-DOS conversion
(only the latter may add characters). The fixeoln
scripts here were developed
as a replacement, after I got burned by running to-DOS
conversions twice.
[*] But wait, it gets worse. Because of the auto-deletion
and insertion of
characters in Windows text mode, we might simply read and
write files in text mode to perform the todos
line conversion when run on
Windows; the file interface will automatically add the
on output if it’s
missing. However, this fails for other usage modes—tounix
conversions on Windows (only
binary writes can omit the
), and todos
when running on Unix (no
is inserted). Magic is
not always our friend.
[*] Recall that the home directory of a running script is
always added to the front of sys.path
to give the script import
visibility to other files in the script’s directory. Because of
that, this script would normally load the PP3EPyToolsfind.py
module anyhow by
just saying import find
; it
need not specify the full package path in the import. The
try
handler and full path
import are useful here only if this script is moved to a
different source directory. Since I move files a lot, I tend to
code with self-inflicted worst-case scenarios in mind.
3.15.226.120