Fixing DOS Line Ends

When I wrote the first edition of this book, I shipped two copies of every example file on the CD-ROM—one with Unix line-end markers and one with DOS markers. The idea was that this would make it easy to view and edit the files on either platform. Readers would simply copy the examples directory tree designed for their platform onto their hard drive and ignore the other one.

If you read Chapter 4, you know the issue here: DOS (and by proxy, Windows) marks line ends in text files with the two characters (carriage return, line feed), but Unix uses just a single . Most modern text editors don’t care—they happily display text files encoded in either format. Some tools are less forgiving, though. I still occasionally see the odd character when viewing DOS files on Unix, or an entire file in a single line when looking at Unix files on DOS (the Notepad accessory does this on Windows, for example).

Because this is only an occasional annoyance, and because it’s easy to forget to keep two distinct example trees in sync, I adopted a different policy as of the book’s second edition: we’re shipping a single copy of the examples (in DOS format), along with a portable converter tool for changing to and from other line-end formats.

The main obstacle, of course, is how to go about providing a portable and easy-to-use converter—one that runs “out of the box” on almost every computer, without changes or recompiles. Some Unix platforms have commands such as fromdos and dos2unix, but they are not universally available even on Unix. DOS batch files and csh scripts could do the job on Windows and Unix, respectively, but neither solution works on both platforms.

Fortunately, Python does. The scripts presented in Examples 7-1, 7-3, and 7-4 convert end-of-line markers between DOS and Unix formats; they convert a single file, a directory of files, and a directory tree of files. In this section, we briefly look at each script and contrast some of the system tools they apply. Each reuses the prior script’s code and becomes progressively more powerful in the process.

The last of these three scripts, Example 7-4, is the portable converter tool I was looking for; it converts line ends in the entire examples tree, in a single step. Because it is pure Python, it also works on both DOS and Unix unchanged; as long as Python is installed, it is the only line converter you may ever need to remember.

Converting Line Ends in One File

These three scripts were developed in stages on purpose, so that I could focus on getting line-feed conversions right before worrying about directories and tree walking logic. With that scheme in mind, Example 7-1 addresses just the task of converting lines in a single text file.

Example 7-1. PP3EPyToolsfixeoln_one.py

##############################################################################
# Use: "python fixeoln_one.py [tounix|todos] filename".
# Convert end-of-lines in the single text file whose name is passed in on the
# command line, to the target format (tounix or todos).  The _one, _dir, and
# _all converters reuse the convert function here.  convertEndlines changes
# end-lines only if necessary: lines that are already in the target format
# are left unchanged, so it's OK to convert a file > once with any of the
# 3 fixeoln scripts.  Note: must use binary file open modes for this to work
# on Windows, else default text mode automatically deletes the 
 on reads,
# and adds extra 
 for each 
 on writes; dee PyToolsdumpfile.py raw bytes;
##############################################################################

import os
listonly = False                      # True=show file to be changed, don't rewrite

def convertEndlines(format, fname):   # convert one file
    if not os.path.isfile(fname):     # todos:  
   => 

        print 'Not a text file', fname                   # tounix: 
 => 

        return                                           # skip directory names

    newlines = []
    changed  = 0
    for line in open(fname, 'rb').readlines( 
 ):
           # use binary i/o modes
        if format == 'todos':                              # else 
 lost on Win
            if line[-1:] == '
' and line[-2:-1] != '
':
                line = line[:-1] + '
'
                changed = 1
        elif format == 'tounix':                         # avoids IndexError
            if line[-2:] == '
':                      # slices are scaled
                line = line[:-2] + '
'
                changed = 1
        newlines.append(line)

    if changed:
        try:                                             # might be read-only
            print 'Changing', fname
            if not listonly: open(fname, 'wb').writelines 

(newlines)
        except IOError, why:
            print 'Error writing to file %s: skipped (%s)' % (fname, why)

if _ _name_ _ == '_ _main_ _':
    import sys
    errmsg = 'Required arguments missing: ["todos"|"tounix"] filename'
    assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg
    convertEndlines(sys.argv[1], sys.argv[2])
    print 'Converted', sys.argv[2]

This script is fairly straightforward as system utilities go; it relies primarily on the built-in file object’s methods. Given a target format flag and filename, it loads the file into a lines list using the readlines method, converts input lines to the target format if needed, and writes the result back to the file with the writelines method if any lines were changed:

C:	empexamples>python %X%PyToolsfixeoln_one.py tounix PyDemos.pyw
Changing PyDemos.pyw
Converted PyDemos.pyw

C:	empexamples>python %X%PyToolsfixeoln_one.py todos PyDemos.pyw
Changing PyDemos.pyw
Converted PyDemos.pyw

C:	empexamples>fc PyDemos.pyw %X%PyDemos.pyw
Comparing files PyDemos.pyw and C:PP3rdEdexamplesPP3EPyDemos.pyw
FC: no differences encountered

C:	empexamples>python %X%PyToolsfixeoln_one.py todos PyDemos.pyw
Converted PyDemos.pyw

C:	empexamples>python %X%PyToolsfixeoln_one.py toother nonesuch.txt
Traceback (innermost last):
  File "C:PP3rdEdexamplesPP3EPyToolsfixeoln_one.py", line 45, in ?
    assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg
AssertionError: Required arguments missing: ["todos"|"tounix"] filename

Here, the first command converts the file to Unix line-end format (tounix), and the second and fourth convert to the DOS convention—all regardless of the platform on which this script is run. To make typical usage easier, converted text is written back to the file in place, instead of to a newly created output file. Notice that this script’s filename has an _ (underscore) in it, not a - (hyphen); because it is meant to be both run as a script and imported as a library, its filename must translate to a legal Python variable name in importers (fixeoln-one.py won’t work for both roles).

Tip

In all the examples in this chapter that change files in directory trees, the C: empexamples and C: empcpexamples directories used in testing are full copies of the real PP3E examples root directory. I don’t always show the copy commands used to create these test directories along the way (at least not until we’ve written our own in Python).

Slinging bytes and verifying results

The fc DOS file-compare command in the preceding interaction confirms the conversions, but to better verify the results of this Python script, I wrote another, shown in Example 7-2.

Example 7-2. PP3EPyToolsdumpfile.py

import sys
bytes = open(sys.argv[1], 'rb').read( )
print '-'*40
print repr(bytes)

print '-'*40
while bytes:
    bytes, chunk = bytes[4:], bytes[:4]          # show four bytes per line
    for c in chunk: print oct(ord(c)), '	',     # show octal of binary value
    print

print '-'*40
for line in open(sys.argv[1], 'rb').readlines( ):
    print repr(line)

To give a clear picture of a file’s contents, this script opens a file in binary mode (to suppress automatic line-feed conversions), prints its raw contents (bytes) all at once, displays the octal numeric ASCII codes of it contents four bytes per line, and shows its raw lines. Let’s use this to trace conversions. First of all, use a simple text file to make wading through bytes a bit more humane:

C:	emp>type test.txt
a
b
c

C:	emp>python %X%PyToolsdumpfile.py test.txt
----------------------------------------
'a
b
c
'
----------------------------------------
0141    015     012     0142
015     012     0143    015
012
----------------------------------------
'a
'
'b
'
'c
'

The test.txt file here is in DOS line-end format; the escape sequence is simply the DOS line-end marker. Now, converting to Unix format changes all the DOS markers to a single as advertised:

C:	emp>python %X%PyToolsfixeoln_one.py tounix test.txt
Changing test.txt
Converted test.txt

C:	emp>python %X%PyToolsdumpfile.py test.txt
----------------------------------------
'a
b
c
'
----------------------------------------
0141    012     0142    012
0143    012
----------------------------------------
'a
'
'b
'
'c
'

And converting back to DOS restores the original file format:

C:	emp>python %X%PyToolsfixeoln_one.py todos test.txt
Changing test.txt
Converted test.txt

C:	emp>python %X%PyToolsdumpfile.py test.txt
----------------------------------------
'a
b
c
'
----------------------------------------
0141    015     012     0142
015     012     0143    015
012
----------------------------------------
'a
'
'b
'
'c
'

C:	emp>python %X%PyToolsfixeoln_one.py todos test.txt    # makes no changes
Converted test.txt

Nonintrusive conversions

Notice that no “Changing” message is emitted for the last command just run because no changes were actually made to the file (it was already in DOS format). Because this program is smart enough to avoid converting a line that is already in the target format, it is safe to rerun on a file even if you can’t recall what format the file already uses. More naïve conversion logic might be simpler, but it may not be repeatable. For instance, a replace string method call can be used to expand a Unix to a DOS , but only once:

>>>lines = 'aaa
bbb
ccc
'
>>> lines = lines.replace('
', '
')         # OK: 
 added
>>> lines
'aaa
bbb
ccc
'
>>> lines = lines.replace('
', '
')         # bad: double 

>>> lines
'aaa

bbb

ccc

'

Such logic could easily trash a file if applied to it twice.[*] To really understand how the script gets around this problem, though, we need to take a closer look at its use of slices and binary file modes.

Slicing strings out of bounds

This script relies on subtle aspects of string slicing behavior to inspect parts of each line without size checks. For instance:

  • The expression line[-2:] returns the last two characters at the end of the line (or one or zero characters, if the line isn’t at least two characters long).

  • A slice such as line[-2:-1] returns the second-to-last character (or an empty string if the line is too small to have a second-to-last character).

  • The operation line[:-2] returns all characters except the last two at the end (or an empty string if there are fewer than three characters).

Because out-of-bounds slices scale slice limits to be inbounds, the script doesn’t need to add explicit tests to guarantee that the line is big enough to have end-line characters at the end. For example:

>>>'aaaXY'[-2:], 'XY'[-2:], 'Y'[-2:], ''[-2:]
('XY', 'XY', 'Y', '')

>>> 'aaaXY'[-2:-1], 'XY'[-2:-1], 'Y'[-2:-1], ''[-2:-1]
('X', 'X', '', '')

>>> 'aaaXY'[:-2], 'aaaY'[:-1], 'XY'[:-2], 'Y'[:-1]
('aaa', 'aaa', '', '')

If you imagine characters such as and rather than the X and Y here, you’ll understand how the script exploits slice scaling to good effect.

Binary file mode revisited

Because this script aims to be portable to Windows, it also takes care to open files in binary mode, even though they contain text data. As we’ve seen, when files are opened in text mode on Windows, is stripped from markers on input, and is added before markers on output. This automatic conversion allows scripts to represent the end-of-line marker as on all platforms. Here, though, it would also mean that the script would never see the it’s looking for to detect a DOS-encoded line because the would be dropped before it ever reached the script:

>>>open('temp.txt', 'w').writelines(['aaa
', 'bbb
'])
>>> open('temp.txt', 'rb').read( )
'aaa
bbb
'
>>> open('temp.txt', 'r').read( )
'aaa
bbb
'

Without binary open mode, this can lead to fairly subtle and incorrect behavior on Windows. For example, if files are opened in text mode, converting in todos mode on Windows would actually produce double characters: the script might convert the stripped to , which is then expanded on output to !

>>>open('temp.txt', 'w').writelines(['aaa
', 'bbb
'])
>>> open('temp.txt', 'rb').read( )
'aaa

bbb

'

With binary mode, the script inputs a full , so no conversion is performed. Binary mode is also required for output on Windows in order to suppress the insertion of characters; without it, the tounix conversion would fail on that platform.[*]

If all that is too subtle to bear, just remember to use the b in file open mode strings if your scripts might be run on Windows, and that you mean to process either true binary data or text data as it is actually stored in the file.

Converting Line Ends in One Directory

Armed with a fully debugged single file converter, it’s an easy step to add support for converting all files in a single directory. Simply call the single file converter on every filename returned by a directory listing tool. The script in Example 7-3 uses the glob module we met in Chapter 4 to grab a list of files to convert.

Example 7-3. PP3EPyToolsfixeoln_dir.py

##########################################################################
# Use: "python fixeoln_dir.py [tounix|todos] patterns?".
# convert end-lines in all the text files in the current directory
# (only: does not recurse to subdirectories). Reuses converter in the
# single-file version, file_one.
##########################################################################

import sys, glob
from fixeoln_one import convertEndlines
listonly = 0
patts = ['*.py', '*.pyw', '*.txt', '*.cgi', '*.html',    # text filenames
         '*.c',  '*.cxx', '*.h',   '*.i',   '*.out',     # in this package
         'README*', 'makefile*', 'output*', '*.note']

if _ _name_ _ == '_ _main_ _':
    errmsg = 'Required first argument missing: "todos" or "tounix"'
    assert (len(sys.argv) >= 2 and sys.argv[1] in ['todos', 'tounix']), errmsg

    if len(sys.argv) > 2:                 # glob anyhow: '*' not applied on DOS
        patts = sys.argv[2:]              # though not really needed on Linux
    filelists = map(glob.glob, patts)     # name matches in this dir only

    count = 0
    for list in filelists:
        for fname in list:
            if listonly:
                print count+1, '=>', fname
            else:
                convertEndlines(sys.argv[1], fname)
            count += 1

    print 'Visited %d files' % count

This module defines a list, patts, containing filename patterns that match all the kinds of text files that appear in the book examples tree; each pattern is passed to the built-in glob.glob call by map to be separately expanded into a list of matching files. That’s why there are nested for loops near the end. The outer loop steps through each glob result list, and the inner steps through each name within each list. Try the map call interactively if this doesn’t make sense:

>>>import glob
>>> map(glob.glob, ['*.py', '*.html'])
[['helloshell.py'], ['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html']]

This script requires a convert mode flag on the command line and assumes that it is run in the directory where files to be converted live; cd to the directory to be converted before running this script (or change it to accept a directory name argument too):

C:	empexamples>python %X%PyToolsfixeoln_dir.py tounix
Changing Launcher.py
Changing Launch_PyGadgets.py
Changing LaunchBrowser.py
...lines deleted...
Changing PyDemos.pyw
Changing PyGadgets_bar.pyw
Changing README-PP3E.txt
Visited 21 files

C:	empexamples>python %X%PyToolsfixeoln_dir.py todos
Changing Launcher.py
Changing Launch_PyGadgets.py
Changing LaunchBrowser.py
...lines deleted...
Changing PyDemos.pyw
Changing PyGadgets_bar.pyw
Changing README-PP3E.txt
Visited 21 files

C:	empexamples>python %X%PyToolsfixeoln_dir.py todos    # makes no changes
Visited 21 files

C:	empexamples>fc PyDemos.pyw %X%PyDemos.pyw
Comparing files PyDemos.pyw and C:PP3rdEdexamplesPP3EPyDemos.pyw
FC: no differences encountered

Notice that the third command generated no “Changing” messages again. Because the convertEndlines function of the single-file module is reused here to perform the actual updates, this script inherits that function’s repeatability: it’s OK to rerun this script on the same directory any number of times. Only lines that require conversion will be converted. This script also accepts an optional list of filename patterns on the command line in order to override the default patts list of files to be changed:

C:	empexamples>python %X%PyToolsfixeoln_dir.py tounix *.pyw *.csh
Changing echoEnvironment.pyw
Changing Launch_PyDemos.pyw
Changing Launch_PyGadgets_bar.pyw
Changing PyDemos.pyw
Changing PyGadgets_bar.pyw
Changing cleanall.csh
Changing makeall.csh
Changing package.csh
Changing setup-pp.csh
Changing setup-pp-embed.csh
Changing xferall.linux.csh
Visited 11 files

C:	empexamples>python %X%PyToolsfixeoln_dir.py tounix *.pyw *.csh
Visited 11 files

Also notice that the single-file script’s convertEndlines function performs an initial os.path.isfile test to make sure the passed-in filename represents a file, not a directory; when we start globbing with patterns to collect files to convert, it’s not impossible that a pattern’s expansion might include the name of a directory along with the desired files.

Tip

Unix and Linux users: Unix-like shells automatically glob (i.e., expand) filename pattern operators like * in command lines before they ever reach your script. You generally need to quote such patterns to pass them in to scripts verbatim (e.g., "*.py"). The fixeoln_dir script will still work if you don’t. Its glob.glob calls will simply find a single matching filename for each already globbed name, and so have no effect:

>>>glob.glob('PyDemos.pyw')
['PyDemos.pyw']

Patterns are not preglobbed in the DOS shell, though, so the glob.glob calls here are still a good idea in scripts that aspire to be as portable as this one.

Converting Line Ends in an Entire Tree

Finally, Example 7-4 applies what we’ve already learned to an entire directory tree. It simply runs the file-converter function to every filename produced by tree-walking logic. In fact, this script really just orchestrates calls to the original and already debugged convertEndlines function.

Example 7-4. PP3EPyToolsfixeoln_all.py

##############################################################################
# Use: "python fixeoln_all.py [tounix|todos] patterns?".
# find and convert end-of-lines in all text files at and below the directory
# where this script is run (the dir you are in when you type the command).
# If needed, tries to use the Python find.py library module, else reads the
# output of a Unix-style find command; uses a default filename patterns list
# if patterns argument is absent. This script only changes files that need
# to be changed, so it's safe to run brute force from a root-level dir.
##############################################################################

import os, sys
debug    = False
pyfind   = False      # force py find
listonly = False      # True=show find results only

def findFiles(patts, debug=debug, pyfind=pyfind):
    try:
        if sys.platform[:3] == 'win' or pyfind:
            print 'Using Python find'
            try:
                import find                        # use python-code find.py
            except ImportError:                    # use mine if deprecated!
                from PP3E.PyTools import find      # may get from my dir anyhow
            matches = map(find.find, patts)        # startdir default = '.'
        else:
            print 'Using find executable'
            matches = []
            for patt in patts:
                findcmd = 'find . -name "%s" -print' % patt    # run find command
                lines = os.popen(findcmd).readlines( )        # remove endlines
                matches.append(map(str.strip, lines))          # lambda x: x[:-1]
    except:
        assert 0, 'Sorry - cannot find files'
    if debug: print matches
    return matches

if _ _name_ _ == '_ _main_ _':
    from fixeoln_dir import patts
    from fixeoln_one import convertEndlines

    errmsg = 'Required first argument missing: "todos" or "tounix"'
    assert (len(sys.argv) >= 2 and sys.argv[1] in ['todos', 'tounix']), errmsg

    if len(sys.argv) > 2:                  # quote in Unix shell
        patts = sys.argv[2:]               # else tries to expand
    matches = findFiles(patts)

    count = 0
    for matchlist in matches:                 # a list of lists
        for fname in matchlist:               # one per pattern
            if listonly:
                print count+1, '=>', fname
            else:
                convertEndlines(sys.argv[1], fname)
            count += 1
    print 'Visited %d files' % count

On Windows, the script uses the portable find.find built-in tool we built in Chapter 4 (the hand-rolled equivalent of Python’s original find module)[*] to generate a list of all matching file and directory names in the tree; on other platforms, it resorts to spawning a less portable and perhaps slower find shell command just for illustration purposes.

Once the file pathname lists are compiled, this script simply converts each found file in turn using the single-file converter module’s tools. Here is the collection of scripts at work converting the book examples tree on Windows; notice that this script also processes the current working directory (CWD; cd to the directory to be converted before typing the command line), and that Python treats forward and backward slashes the same way in the program filename:

C:	empexamples>python %X%/PyTools/fixeoln_all.py tounix
Using Python find
Changing .LaunchBrowser.py
Changing .Launch_PyGadgets.py
Changing .Launcher.py
Changing .Othercgimail.py
...lots of lines deleted...
Changing .EmbExtExportsClassAndModoutput.prog1
Changing .EmbExtExportsoutput.prog1
Changing .EmbExtRegistoutput
Visited 1051 files

C:	empexamples>python %X%/PyTools/fixeoln_all.py todos
Using Python find
Changing .LaunchBrowser.py
Changing .Launch_PyGadgets.py
Changing .Launcher.py
Changing .Othercgimail.py
...lots of lines deleted...
Changing .EmbExtExportsClassAndModoutput.prog1
Changing .EmbExtExportsoutput.prog1
Changing .EmbExtRegistoutput
Visited 1051 files

C:	empexamples>python %X%/PyTools/fixeoln_all.py todos
Using Python find
Not a text file .EmbedInventoryOutput
Not a text file .EmbedInventoryWithDbaseOutput
Visited 1051 files

The view from the top

This script and its ancestors are shipped in the book’s example distribution as that portable converter tool I was looking for. To convert all example files in the tree to Unix line-terminator format, simply copy the entire PP3E examples tree to some “examples” directory on your hard drive and type these two commands in a shell:

cd examples/PP3E
python PyTools/fixeoln_all.py tounix

Of course, this assumes Python is already installed (see the example distribution’s README file for details) but will work on almost every platform in use today. To convert back to DOS, just replace tounix with todos and rerun. I ship this tool with a training CD for Python classes I teach too; to convert those files, we simply type:

cd HtmlExamples
python ....Toolsfixeoln_all.py tounix

Once you get accustomed to the command lines, you can use this in all sorts of contexts. Finally, to make the conversion easier for beginners to run, the top-level examples directory includes tounix.py and todos.py scripts that can be simply double-clicked in a file explorer GUI; Example 7-5 shows the tounix converter.

Example 7-5. PP3E ounix.py

#!/usr/local/bin/python
######################################################################
# Run me to convert all text files to Unix/Linux line-feed format.
# You only need to do this if you see odd '
' characters at the end
# of lines in text files in this distribution, when they are viewed
# with your text editor (e.g., vi).  This script converts all files
# at and below the examples root, and only converts files that have
# not already been converted (it's OK to run this multiple times).
#
# Since this is a Python script which runs another Python script,
# you must install Python first to run this program; then from your
# system command line (e.g., a xterm window), cd to the directory
# where this script lives, and then type "python tounix.py".  You
# may also be able to simply click on this file's icon in your file
# system explorer, if it knows what '.py' files are.
######################################################################

import os
prompt = """
This program converts all text files in the book
examples distribution to UNIX line-feed format.
Are you sure you want to do this (y=yes)? """

answer = raw_input(prompt)
if answer not in ['y', 'Y', 'yes']:
    print 'Cancelled'
else:
    os.system('python PyTools/fixeoln_all.py tounix')

This script addresses the end user’s perception of usability, but other factors impact programmer usability—just as important to systems that will be read or changed by others. For example, the file, directory, and tree converters are coded in separate script files, but there is no law against combining them into a single program that relies on a command-line arguments pattern to know which of the three modes to run. The first argument could be a mode flag, tested by such a program:

if   mode == '-one':
    ...
elif mode == '-dir':
    ...
elif mode == '-all:
    ...

That seems more confusing than separate files per mode, though; it’s usually much easier to botch a complex command line than to type a specific program file’s name. It will also make for a confusing mix of global names and one very big piece of code at the bottom of the file. As always, simpler is usually better.



[*] In fact, see the files old_todos.py, old_tounix.py, and old_toboth.py in the PyTools directory in the examples distribution for a complete earlier implementation built around replace. It was repeatable for to-Unix changes, but not for to-DOS conversion (only the latter may add characters). The fixeoln scripts here were developed as a replacement, after I got burned by running to-DOS conversions twice.

[*] But wait, it gets worse. Because of the auto-deletion and insertion of characters in Windows text mode, we might simply read and write files in text mode to perform the todos line conversion when run on Windows; the file interface will automatically add the on output if it’s missing. However, this fails for other usage modes—tounix conversions on Windows (only binary writes can omit the ), and todos when running on Unix (no is inserted). Magic is not always our friend.

[*] Recall that the home directory of a running script is always added to the front of sys.path to give the script import visibility to other files in the script’s directory. Because of that, this script would normally load the PP3EPyToolsfind.py module anyhow by just saying import find; it need not specify the full package path in the import. The try handler and full path import are useful here only if this script is moved to a different source directory. Since I move files a lot, I tend to code with self-inflicted worst-case scenarios in mind.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.226.120