Chapter 8. Files and Directories

In this chapter, you'll get to know some of the types and functions that Python provides for writing and reading files and accessing the contents of directories. These functions are important, because almost all nontrivial programs use files to read input or store output.

Python provides a rich collection of input/output functions; this chapter covers those that are most widely used. First, you'll use file objects, the most basic implementation of input/output in Python. Then you'll learn about functions for manipulating paths, retrieving information about files, and accessing directory contents.

In this chapter you learn:

  • Some of the types and functions that Python provides for writing and reading files and accessing the contents of directories. These functions are important, because almost all nontrivial programs use files to read input or store output.

  • About Python's rich collection of input/output functions; this chapter covers those that are most widely used.

  • To use file objects, the most basic implementation of input/output in Python.

  • About functions for manipulating paths, retrieving information about files, and accessing directory contents.

File Objects

In this chapter, most of the examples use Windows path names. If you are working on a different platform, replace the example paths with paths appropriate for your system.

If you do use Windows, however, remember that a backslash is a special character in a Python string, so you must escape (that is, double up) any backslash in a path. For instance, the path C:WindowsTemp is represented by the Python string "C:\Windows\Temp". If you prefer, you can instead disable special treatment of backslashes in a string by placing an r before the opening quotes, so this same path may be written r"C:WindowsTemp".

You'll use a string object to hold the path name for a sample file you create and access. If you're using Windows, enter the following (you can choose another path if you want):

>>> path = "C:\sample.txt"

If you're using Linux, enter the following (or choose a path of your own):

>>> path = "/tmp/sample.txt"

Writing Text Files

Start by creating a file with some simple text. To create a new file on your system, create a file object, and tell Python you want to write to it. A file object represents a connection to a file, not the file itself, but if you open a file for writing that doesn't exist, Python creates the file automatically. Enter the following:

>>> def make_text_file():
  a=open('test.txt',"w")
  a.write("This is how you create a new text file")
 a.close()

You start off by creating a new function called make_text_file(). You then tell Python to open a file named test.txt. Because Python does not find this file, it creates it for you. (Note: if the file did exist, Python would have deleted it and created a new one, so be careful when using this technique! In a moment, you learn to check to see if a file exists prior to creating one.) The "w" argument tells Python that you intend to write to the file; without it, Python would assume you intend to read from the file and would raise an exception when it found that the file didn't exist. Next, you add a line of text to the file, namely: "This is how you create a new text file".

Take a moment to navigate to your Python directory, which should be located somewhere such as C://Python31. You will notice that a new file named test.txt has been created. If you double-click it, you will see the text you added in the preceding example. Congratulations, you have created your first file!

Now that you have created a file with the preceding technique, create a program that first checks to see if the file name exists; if so, it will give you an error message; if not, it will create the file. Type in the following code:

>>> import os
>>> def make_another_file():
  if os.path.isfile('test.txt'):
    print("You are trying to create a file that already exists!")
  else:
    f=open('test.txt',"w")
    f.write("This is how you create a new text file")
...
>>> make_another_file()
"You are trying to create a file that already exists!"

When opening a file, and with all the other file-manipulation functions discussed in this chapter, you can specify either a relative path (a path relative to the current directory, the directory in which your program or Python was run) or an absolute path (a path starting at the root of the drive or file system). For example, /tmp/sample.txt is an absolute path, whereas just sample.txt, without the specification of what directory is above it, is a relative path.

Appending Text to a File

Appending text to a file is a pretty simple to do. Instead of using the write method ("w"), you use append instead ("a"). By doing so, you ensure that the data in the existing file is not overwritten, but instead, any new text is appended to the end of the file. Try out the following code:

Reading Text Files

Reading from a file is similar. First, open the file by creating a file object. This time, use "r" to tell Python you intend to read from the file. It's the default, so you can omit the second argument altogether if you want.

a=open("test.txt","r")

Make sure you use the path to the file you created earlier, or use the path to some other file you want to read. If the file doesn't exist, Python will raise an exception.

You can read a line from the file using the readline method. The first time you call this method on a file object, it will return the first line of text in the file:

>>> a.readline()
'This is how you create a new text file
'

Notice that readline includes the newline character at the end of the string it returns. To read the contents of the file one line at a time, call readline repeatedly.

You can also read the rest of the file all at once with the read method. This method returns any text in the file that you haven't read yet. (If you call read as soon as you open a file, it will return the entire contents of the file, as one long string.)

>>> f=open("test.txt","r")
>>> text=a.read()
>>> print(text)
This is how you create a new text file
Here is some additional text
here is
more
text

Because you've used print to print the text, Python shows newline characters as actual line breaks, instead of as .

When you're done reading the file, close the file by deleting the file object and closing the file:

>>> del a
>>> a.close()

It's convenient to have Python break a text file into lines, but it's nice to be able to get all the lines at one time — for instance, to use in a loop. The readlines method does exactly that: It returns the remaining lines in the file as a list of strings. Suppose, for instance, that you want to print out the length of each line in a file. This function will do that:

def print_line_lengths():
  a=open("test.txt","r")
  text=a.readlines()
  for line in text:
    print(len(line))

File Exceptions

Because your Python program does not have exclusive control of the computer's file system, it must be prepared to handle unexpected errors when accessing files. When Python encounters a problem performing a file operation, it raises an IOError exception. (Exceptions are described in Chapter 4.) The string representation of the exception will describe the problem.

Many circumstances exist in which you can get an IOError, including the following:

  • If you attempt to open a file for reading that does not exist

  • If you attempt to create a file in a directory that does not exist

  • If you attempt to open a file for which you do not have read access

  • If you attempt to create a file in a directory for which you do not have write access

  • If your computer encounters a disk error (or network error, if you are accessing a file on a network disk)

If you want your program to react gracefully when errors occur, you must handle these exceptions. What to do when you receive an exception depends on what your program does. In some cases, you may want to try a different file, perhaps after printing a warning message. In other cases, you may have to ask the user what to do next or simply exit if recovery is not possible.

Paths and Directories

The file systems on Windows, Linux, UNIX, and Mac OS/X have a lot in common but differ in some of their rules, conventions, and capabilities. For example, Windows uses a backslash to separate directory names in a path, whereas Linux and UNIX (and Mac OS/X is a type of UNIX) use a forward slash. In addition, Windows uses drive letters, whereas the others don't. These differences can be a major irritation if you are writing a program that will run on different platforms. Python makes your life easier by hiding some of the annoying details of path and directory manipulation in the os module. Using os will not solve all of your portability problems, however; some functions in os are not available on all platforms. This section describes only those functions that are.

Even if you intend to use your programs only on a single platform and anticipate being able to avoid most of these issues, if your program is useful you never know if someone will try to run it on another platform someday. So it's better to tap the os module, because it provides many useful services. Don't forget to import os first so you can use it.

Exceptions in os

The functions in the os module raise OSError exceptions on failure. If you want your program to behave nicely when things go wrong, you must handle this exception. As with IOError, the string representation of the exception will provide a description of the problem.

Paths

The os module contains another module, os.path, which provides functions for manipulating paths. Because paths are strings, you could use ordinary string manipulation to assemble and disassemble file paths. Your code would not be as easily portable, however, and would probably not handle special cases that os.path knows about. Use os.path to manipulate paths, and your programs will be better for it.

To assemble directory names into a path, use os.path.join. Python uses the path separator appropriate for your operating system. Don't forget to import the os.path module before you use it. For example, on Windows, enter the following:

>>> import os.path
>>> os.path.join("snakes", "Python")
'snakes\Python'

On Linux, however, using the same parameters to os.path.join gives you the following, different, result:

>>> import os.path
>>> os.path.join("snakes", "Python")
'snakes/Python'

You can specify more than two components as well.

The inverse function is os.path.split, which splits off the last component of a path. It returns a tuple of two items: the path of the parent directory and the last path component. Here's an example:

>>> os.path.split("C:\Program Files\Python30\Lib")
('C:\Program Files\Python30', 'Lib')

On UNIX or Linux, it would look like this:

>>> os.path.split("/usr/bin/python")
('/usr/bin', 'python')

Automatic unpacking of sequences comes in handy here. What happens is that when os.path.split returns a tuple, the tuple can be broken up into the elements on the left-hand side of the equals sign:

>>> parent_path, name = os.path.split("C:\Program Files\Python30\Lib")
>>> print(parent_path)
C:Program FilesPython30
>>> print(name)
Lib

Although os.path.split only splits off the last path component, sometimes you might want to split a path completely into directory names. Writing a function to do this is not difficult; what you want to do is call os.path.split on the path, and then call os.path.split on the parent directory path, and so forth, until you get all the way to the root directory. An elegant way to do this is with a recursive function, which is a function that calls itself. It might look like this:

def split_fully(path):
    parent_path, name = os.path.split(path)
    if name == "":
        return (parent_path, )
    else:
        return split_fully(parent_path) + (name, )

The key line is the last line, where the function calls itself to split the parent path into components. The last component of the path, name, is then attached to the end of the fully split parent path. The lines in the middle of split_fully prevent the function from calling itself infinitely. When os.path.split can't split a path any further, it returns an empty string for the second component; split_fully notices this and returns the parent path without calling itself again.

A function can call itself safely because Python keeps track of the arguments and local variables in each running instance of the function, even if one is called from another. In this case, when split_fully calls itself, the outer (first) instance doesn't lose its value of name even though the inner (second) instance assigns a different value to it, because each has its own copy of the variable name. When the inner instance returns, the outer instance continues with the same variable values it had when it made the recursive call.

When you write a recursive function, make sure that it never calls itself infinitely, which would be bad because it would never return. (Actually, Python would run out of space in which to keep track of all the calls, and would raise an exception.) The function split_fully won't call itself infinitely, because eventually path is short enough that name is an empty string, and the function returns without calling itself again.

Notice in this function the two uses of single-element tuples, which must include a comma in the parentheses. Without the comma, Python would interpret the parentheses as ordinary grouping parentheses, as in a mathematical expression: (name, ) is a tuple with one element; (name) is the same as name.

Here's the function in action:

>>> split_fully("C:\Program Files\Python31\Lib")
('C:', 'Program Files', 'Python31', 'Lib')

After you have the name of a file, you can split off its extension with os.path.splitext:

>>> os.path.splitext("image.jpg")
('image', '.jpg')

The call to splitext returns a two-element tuple, so you can extract just the extension as shown here:

>>> parts = os.path.splitext("image.jpg")
>>> extension = parts[1]

You don't actually need the variable parts at all. You can extract the second component, the extension, directly from the return value of splitext:

>>> extension = os.path.splitext("image.jpg")[1]

Also handy is os.path.normpath, which normalizes or "cleans up" a path:

>>> print(os.path.normpath(r"C:\Program FilesPerl..Python30"))
C:Program FilesPython30

Notice how the ".." was eliminated by backing up one directory component, and the double separator was fixed. Similar to this is os.path.abspath, which converts a relative path (a path relative to the current directory) to an absolute path (a path starting at the root of the drive or file system):

>>> print(os.path.abspath("other_stuff"))
C:Program FilesPython30other_stuff

Your output will depend on your current directory when you call abspath. As you may have noticed, this works even though you don't have an actual file or directory named other_stuff in your Python directory. None of the path manipulation functions in os.path check whether the path you are manipulating actually exists.

If you want to know whether a path actually does exist, use os.path.exists. It simply returns True or False:

>>> os.path.exists("C:\Windows")
True
>>> os.path.exists("C:\Windows\reptiles")
False

Of course, if you're not using Windows, or your Windows is installed in another directory (like C:WinNT), both of these will return False!

Directory Contents

Now you know how to construct arbitrary paths and take them apart. But how can you find out what's actually on your disk? The os.listdir module tells you, by returning a list of the name entries in a directory — the files, subdirectories, and so on that it contains.

Obtaining Information about Files

You can easily determine whether a path refers to a file or to a directory. If it's a file, os.path.isfile will return True; if it's a directory, os.path.isdir will return True. Both return False if the path does not exist at all:

>>> os.path.isfile("C:\Windows")
False
>>> os.path.isdir("C:\Windows")
True

Recursive Directory Listings

You can combine os.path.isdir with os.listdir to do something very useful: process subdirectories recursively. For instance, you can list the contents of a directory, its subdirectories, their subdirectories, and so on. To do this, it's again useful to write a recursive function. This time, when the function finds a subdirectory, it calls itself to list the contents of that subdirectory:

def print_tree(dir_path):
    for name in os.listdir(dir_path):
        full_path = os.path.join(dir_path, name)
        print(full_path)
        if os.path.isdir(full_path):
            print_tree(full_path)

You'll notice the similarity to the function print_dir you wrote previously. This function, however, constructs the full path to each entry as full_path, because it's needed both for printing out and for consideration as a subdirectory. The last two lines check whether it is a subdirectory, and if so, the function calls itself to list the subdirectory's contents before continuing. If you try this function, make sure that you don't call it for a large directory tree; otherwise, you'll have to wait a while as it prints out the full path of every single subdirectory and file in the tree.

Other functions in os.path provide information about a file. For instance, os.path.getsize returns the size, in bytes, of a file without having to open and scan it. Use os.path.getmtime to obtain the time when the file was last modified. The return value is the number of seconds between the start of the year 1970 and when the file was last modified — not a format users prefer for dates! You have to call another function, time.ctime, to convert the result to an easily understood format (don't forget to import the time module first). Here's an example that outputs when your Python installation directory was last modified, which is probably the date and time you installed Python on your computer:

>>> import time
>>> mod_time = os.path.getmtime("C:\Python30")
>>> print(time.ctime(mod_time))
Thu Mar 15 01:36:26 2009

Now you know how to modify print_dir to print the contents of a directory, including the size and modification time of each file. In the interest of brevity, the version that follows prints only the names of entries, not their full paths:

def print_dir_info(dir_path):
    for name in os.listdir(dir_path):
        full_path = os.path.join(dir_path, name)
        file_size = os.path.getsize(full_path)
        mod_time = time.ctime(os.path.getmtime(full_path))
        print("%−32s: %8d bytes, modified %s" % (name, file_size, mod_time))

The last statement uses Python's built-in string formatting that you saw in Chapters 1 and 2 to produce neatly aligned output. If there's other file information you would like to print, browse the documentation for the os.path module to learn how to obtain it.

Renaming, Moving, Copying, and Removing Files

The shutil module contains functions for operating on files. You can use the function shutil.move to rename a file:

>>> import shutil
>>> shutil.move("server.log", "server.log.backup")

Alternatively, you can use it to move a file to another directory:

>>> shutil.move("old mail.txt", "C:\data\archive\")

You might have noticed that os also contains a function for renaming or moving files, os.rename. You should generally use shutil.move instead, because with os.rename, you may not specify a directory name as the destination and on some systems os.rename cannot move a file to another disk or file system.

The shutil module also provides the copy function to copy a file to a new name or directory. You can simply use the following:

>>> shutil.copy("important.dat", "C:\backups")

Deleting a file is easiest of all. Just call os.remove:

>>> os.remove("junk.dat")

If you're an old-school UNIX hacker (or want to pass yourself off as one), you may prefer os.unlink, which does the same thing.

Example: Rotating Files

In this example, you tackle a more difficult real-world file management task. Suppose that you need to keep old versions of a file around. For instance, system administrators will keep old versions of system log files. Often, older versions of a file are named with a numerical suffix — for instance, web.log.1, web.log.2, and so on — in which a larger number indicates an older version. To make room for a new version of the file, the old versions are rotated: The current version of web.log becomes version web.log.1, web.log.1 becomes web.log.2, and so on.

This is clearly tedious to do by hand, but Python can make quick work of it. You have a few tricky points to consider, however. First, the current version of the file is named differently than old versions; whereas old versions have a numerical suffix, the current version does not. One way to get around this is to treat the current version as version zero. A short function, make_version_path, constructs the right path for both current and old versions.

The other subtle point is that you must make sure to rename the oldest version first. For instance, if you rename web.log.1 to web.log.2 before renaming web.log.2, the latter will be overwritten and its contents lost before you get to it, which isn't what you want. Once again, a recursive function will save you. The function can call itself to rotate the next-older version of the log file before it gets overwritten:

import os
import shutil

def make_version_path(path, version):
    if version == 0:
        # No suffix for version 0, the current version.
        return path
    else:
        # Append a suffix to indicate the older version.
        return path + "." + str(version)

def rotate(path, version=0):
    # Construct the name of the version we're rotating.
    old_path = make_version_path(path, version)
    if not os.path.exists(old_path):
        # It doesn't exist, so complain.
        raise IOError("'%s' doesn't exist" % path)
    # Construct the new version name for this file.
    new_path = make_version_path(path, version + 1)
    # Is there already a version with this name?
    if os.path.exists(new_path):
        # Yes.  Rotate it out of the way first!
        rotate(path, version + 1)
    # Now we can rename the version safely.
    shutil.move(old_path, new_path)

Take a few minutes to study this code and the comments. The rotate function uses a technique common in recursive functions: a second argument for handing recursive cases — in this case, the version number of the file being rotated. The argument has a default value, zero, which indicates the current version of the file. When you call the function (as opposed to when the function is calling itself) you don't specify a value for this argument. For example, you can just call rotate("web.log").

You may have noticed that the function checks to make sure that the file being rotated actually exists and raises an exception if it doesn't. But suppose you want to rotate a system log file that may or may not exist. One way to handle this is to create an empty log file whenever it's missing. Remember that when you open a file that doesn't exist for writing, Python creates the file automatically. If you don't actually write anything to the new file, it will be empty. Here's a function that rotates a log file that may or may not exist, creating it first if it doesn't. It uses the rotate function you wrote previously.

def rotate_log_file(path):
    if not os.path.exists(path):
        # The file is missing, so create it.
        new_file = file(path, "w")
        # Close the new file immediately, which leaves it empty.
        del new_file
    # Now rotate it.
    rotate(path)

Creating and Removing Directories

Creating an empty directory is even easier than creating a file. Just call os.mkdir. The parent directory must exist, however. The following will raise an exception if the parent directory C:photoszoo does not exist:

>>> os.mkdir("C:\photos\zoo\snakes")

You can create the parent directory itself using os.mkdir, but the easy way out is instead to use os.makedirs, which creates missing parent directories. For example, the following will create C:photos and C:photoszoo, if necessary:

>>> os.makedirs("C:\photos\zoo\snakes")

Remove a directory with os.rmdir. This works only for empty directories; if the directory is not empty, you'll have to remove its contents first:

>>> os.rmdir("C:\photos\zoo\snakes")

This removes only the snakes subdirectory.

There is a way to remove a directory even when it contains other files and subdirectories. The function shutil.rmtree does this. Be careful, however; if you make a programming or typing mistake and pass the wrong path to this function, you could delete a whole bunch of files before you even know what's going on! For instance, this will delete your entire photo collection — zoo, snakes, and all:

>>> shutil.rmtree("C:\photos")

Globbing

If you have used the command prompt on Windows, or a shell command line on GNU/Linux, UNIX, or Mac OS X, you probably have encountered wildcard patterns before. These are the special characters, such as * and ?, which you use to match many files with similar names. For example, you may have used the pattern P* to match all files that start with P, or *.txt to match all files with the extension .txt.

Globbing is hackers' jargon for expanding wildcards in file name patterns. Python provides a function glob, in the module also named glob, which implements globbing of directory contents. The glob.glob function takes a glob pattern and returns a list of matching file names or paths, similar to os.listdir.

For example, try the following command to list entries in your C:Program Files directory that start with M:

>>> import glob
>>> glob.glob("C:\Program Files\M*")
['C:\Program Files\Messenger', 'C:\Program Files\Microsoft Office',
'C:\Program Files\Mozilla Firefox']

Your computer's output will vary depending on what software you have installed. Observe that glob.glob returns paths containing drive letters and directory names if the pattern includes them, unlike os.listdir, which only returns the names in the specified directory.

The following table lists the wildcards you can use in glob patterns. These wildcards are not necessarily the same as those available in the command shell of your operating system, but Python's glob module uses the same syntax on all platforms. Note that the syntax for glob patterns resembles but is not the same as the syntax for regular expressions.

Wildcard

Matches

Example

*

Any zero or more characters

*.m* matches names whose extensions begin with m.

?

Any one character

??? matches names exactly three characters long.

[...]

Any one character listed in the brackets

[AEIOU]* matches names that begin with capital vowels.

[!...]

Any one character not listed in the brackets

*[!s] matches names that don't end with an s.

You can also use a range of characters in square brackets. For example, [m-p] matches any one of the letters m, n, o, or p, and [!0-9] matches any character other than a digit.

Globbing is a handy way of selecting a group of similar files for a file operation. For instance, deleting all backup files with the extension .bak in the directory C:source is as easy as these two lines:

>>> for path in glob.glob("C:\source\*.bak"):
...     os.remove(path)

Globbing is considerably more powerful than os.listdir, because you can specify wildcards in directory and subdirectory names. For patterns like this, glob.glob can return paths in more than one directory. For instance, the following code returns all files with the extension .txt in subdirectories of the current directory:

>>> glob.glob("*\*.txt")

Summary

In this chapter, you learned how to write data to and read data from files on your disk. Using a file object, you can now write strings to a file, and read back the contents of a file, line-by-line or all at once. You can use these techniques to read input into your program, to generate output files, or to store intermediate results.

You also learned about paths, which specify the location of a file on your disk, and how to manipulate them. Using os.listdir or glob, you can find out what's on your disk.

The key points to take away from this chapter are:

  • A file object represents a connection to a file, not the file itself, but if you open a file for writing that doesn't exist, Python creates the file automatically.

  • To append to a file, use append instead of write. This ensures that the data in the file is not overwritten.

  • To read from a file, use "r", as in the following: a=open("test.txt","r")

  • The readline method returns the first line of text in a file.

  • When you are finished reading a file, be sure to delete the file object and explicitly close the file.

  • The os.path module, located in the os module, provides functions for manipulating paths.

  • The os.listdir module tells you the files, subdirectories, and contents in a directory.

  • Globbing is hackers' jargon for expanding wildcards in filename patterns. Python provides a function glob, in the module also named glob, which implements globbing of directory contents. The glob.glob function takes a glob pattern and returns a list of matching filenames or paths, similar to os.listdir.

Exercises

  1. Create another version of the (nonrecursive) print_dir function that lists all subdirectory names first, followed by names of files in the directory. Names of subdirectories should be alphabetized, as should file names. (For extra credit, write your function in such a way that it calls os.listdir only one time. Python can manipulate strings faster than it can execute os.listdir.)

  2. Modify the rotate function to keep only a fixed number of old versions of the file. The number of versions should be specified in an additional parameter. Excess old versions above this number should be deleted.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.80.34