6. Data

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Data

Introduction

The need to control dealing with data, files, and directories is one of the reasons IT organizations need sysadmins. What sysadmin hasn’t had the need to process all of the files in a directory tree and parse and replace text? And if you haven’t written a script yet that renames all of the files in a directory tree, you probably will at some point in the future. These abilities are the essence of what it means to be a sysadmin, or at least to be a really good sysadmin. For the rest of this chapter, we’re going to focus on data, files, and directories.

Sysadmins need to constantly wrangle data from one location to the next. The movement of data on a daily basis is more prevelant in some sysadmin jobs than others. In the animation industry, constantly “wrangling” data from one location to the next is required because digital film production requires terabytes upon terabytes of storage. Also, there are different disk I/O requirements based on the quality and resolution of the image being viewed at any given time. If data needs to be “wrangled” to an HD preview room so that it can be inspected during a digital daily, then the “fresh” uncompressed, or slightly compressed, HD image files will need to be moved. Files need to be moved because there are generally two types of storage in animation. There is cheap, large, slow, safe, storage, and there is fast, expensive storage that is oftentimes a JBOD, or “just a bunch of disks,” striped together RAID 0 for speed. A sysadmin in the film industry who primarily deals with data is called a “data wrangler.”

A data wrangler needs to be constantly moving and migrating fresh data from location to location. Often the workhorse of moving data is rsync, scp, cp, or mv. These simple and powerful tools can be scripted with Python to do some incredible things.

Using the standard library, it is possible to do some amazing things without shelling out once. The advantage of using the standard library is that your data moving script will work just about anywhere, without having to depend on a platform-specific version of, say, tar.

Let’s also not forget backups. There are many custom backup scripts and applications that can be written with a trivial amount of Python code. We will caution that writing extra tests for your backup code is not only wise, but necessary. You should make sure you have both unit, and functional testing in place when you are depending on backup scripts you have written yourself.

In addition, it is often necessary to process data at some point before, after, or during a move. Of course, Python is great for this as well. Creating a deduplication tool, a tool that finds duplicate files, and performs actions upon them can be very helpful for this, so we’ll show you how to do it. This is one example of dealing with the constant flow of data that a sysadmin often encounters.

Using the OS Module to Interact with Data

If you have ever struggled with writing cross-platform shell scripts, you will appreciate the fact that the OS module is a portable application programming interface (API) to system services. In Python 2.5, the OS module contains over 200 methods, and many of those methods deal with data. In this section, we will go over many of the methods in that module that systems administrators care about when dealing with data.

Whenever you find yourself needing to explore a new module, IPython is often the right tool for the job, so let’s start our journey through the OS module using IPython to execute a sequence of actions that are fairly commonly encountered. Example 6-1 shows you how to do that.

Example 6-1. Exploring common OS module data methods

In [1]: import os

In [2]: os.getcwd()
Out[2]: '/private/tmp'

In [3]: os.mkdir("/tmp/os_mod_explore")

In [4]: os.listdir("/tmp/os_mod_explore")
Out[4]: []

In [5]: os.mkdir("/tmp/os_mod_explore/test_dir1")

In [6]: os.listdir("/tmp/os_mod_explore")
Out[6]: ['test_dir1']

In [7]: os.stat("/tmp/os_mod_explore")
Out[7]: (16877, 6029306L, 234881026L, 3, 501, 0, 102L, 
1207014425, 1207014398, 1207014398)

In [8]: os.rename("/tmp/os_mod_explore/test_dir1", 
"/tmp/os_mod_explore/test_dir1_renamed")

In [9]: os.listdir("/tmp/os_mod_explore")
Out[9]: ['test_dir1_renamed']

In [10]: os.rmdir("/tmp/os_mod_explore/test_dir1_renamed")

In [11]: os.rmdir("/tmp/os_mod_explore/")

As you can see, after we imported the OS module, in line [2] we get the current working directory, then proceed to make a directory in line [3]. We then use os.listdir in line [4] to list the contents of our newly created directory. Next, we do an os.stat, which is very similar to the stat command in Bash, and then rename a directory in line [8]. In line [9], we verify that the directory was created and then we proceed to delete what we created by using the os.rmdir method.

This is by no means an exhaustive look at the OS module. There are methods to do just about anything you would need to do to the data, including changing permissions and creating symbolic links. Please refer to the documentation for the version of Python you are using, or alternately, use IPython with tab completion to view the available methods for the OS module.

Copying, Moving, Renaming, and Deleting Data

Since we talked about data wrangling in the introduction, and you now also have a bit of an idea about how to use the OS module, we can jump right into a higher-level module, called shutil that deals with data on a larger scale. The shutil module has methods for copying, moving, renaming, and deleting data just as the OS module does, but it can perform actions on an entire data tree.

Exploring the shutil module with IPython is a fun way to get aquainted with it. In the example below, we will be using shutil.copytree, but shutil has many other methods that do slightly different things. Please refer to the Python Standard Library documentation to see the differences between shutil copy methods. See Example 6-2.

Example 6-2. Using the shutil module to copy a data tree

In [1]: import os
            
In [2]: os.chdir("/tmp")
In [3]: os.makedirs("test/test_subdir1/test_subdir2")

In [4]: ls -lR
total 0
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test/

./test:
total 0
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test_subdir1/

./test/test_subdir1:
total 0
drwxr-xr-x  2 ngift  wheel  68 Mar 31 22:27 test_subdir2/

./test/test_subdir1/test_subdir2:            

            
In [5]: import shutil

In [6]: shutil.copytree("test", "test-copy")

In [19]: ls -lR
total 0
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test/
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test-copy/

./test:
total 0
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test_subdir1/

./test/test_subdir1:
total 0
drwxr-xr-x  2 ngift  wheel  68 Mar 31 22:27 test_subdir2/

./test/test_subdir1/test_subdir2:

./test-copy:
total 0
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test_subdir1/

./test-copy/test_subdir1:
total 0
drwxr-xr-x  2 ngift  wheel  68 Mar 31 22:27 test_subdir2/

./test-copy/test_subdir1/test_subdir2:

Obviously, this is quite simple, yet incredibly useful, as you can quite easily wrap this type of code into a more sophisticated cross-platform, data mover script. The immediate use for this kind of code sequence that pops into our heads is to move data from one filesystem to another on an event. In an animation environment, it is often necessary to wait for the latest frames to be finished to convert them into a sequence to edit.

We could write a script to watch a directory for “x” number of frames in a cron job. When that cron job sees that the correct number of frames has been reached, it could then migrate that directory into another directory where the frames could be processed, or even just moved so that they are on a faster disk with I/O quick enough to handle playback of uncompressed HD footage.

The shutil module doesn’t just copy files though, it also has methods for moving and deleting trees of data as well. Example 6-3 shows a move of our tree, and Example 6-4 shows how to delete it.

Example 6-3. Moving a data tree with shutil

In [20]: shutil.move("test-copy", "test-copy-moved")
    
In [21]: ls -lR
total 0
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test/
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test-copy-moved/
    
./test:
total 0
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test_subdir1/
    
./test/test_subdir1:
total 0
drwxr-xr-x  2 ngift  wheel  68 Mar 31 22:27 test_subdir2/
    
./test/test_subdir1/test_subdir2:
    
./test-copy-moved:
total 0
drwxr-xr-x  3 ngift  wheel  102 Mar 31 22:27 test_subdir1/
    
./test-copy-moved/test_subdir1:
total 0
drwxr-xr-x  2 ngift  wheel  68 Mar 31 22:27 test_subdir2/
    
./test-copy-moved/test_subdir1/test_subdir2:

Example 6-4. Deleting a data tree with shutil

In [22]: shutil.rmtree("test-copy-moved")
                
    In [23]: shutil.rmtree("test-copy")
    In [24]: ll

Moving a data tree is a bit more exciting than deleting a data tree, as there is nothing to show after a delete. Many of these simple examples could be combined with other actions in more sophisticated scripts. One kind of script that might be useful is to write a backup tool that copies a directory tree to cheap network storage and then creates a datestamped archive. Fortunately, we have an example of doing just that in pure Python in the backup section of this chapter.

Working with Paths, Directories, and Files

One can’t talk about dealing with data without taking into account paths, directories, and files. Every sysadmin needs to be able to, at the very least, write a tool that walks a directory, searches for a condition, and then does something with the result. We are going to cover some interesting ways to do just that.

As always, the Standard Library in Python has some killer tools to get the job done. Python doesn’t have a reputation for being “batteries included” for nothing. Example 6-5 shows how to create an extra verbose directory walking script with functions that explicitly return files, directories, and paths.

Example 6-5. Verbose directory walking script

import os
path = "/tmp"

def enumeratepaths(path=path):
"""Returns the path to all the files in a directory recursively"""
path_collection = []
for dirpath, dirnames, filenames in os.walk(path):
for file in filenames:
    fullpath = os.path.join(dirpath, file)
    path_collection.append(fullpath)

return path_collection

def enumeratefiles(path=path):
"""Returns all the files in a directory as a list"""
file_collection = []
for dirpath, dirnames, filenames in os.walk(path):
    for file in filenames:
        file_collection.append(file)

return file_collection

def enumeratedir(path=path):
"""Returns all the directories in a directory as a list"""
dir_collection = []
for dirpath, dirnames, filenames in os.walk(path):
    for dir in dirnames:
        dir_collection.append(dir)

return dir_collection

if __name__ == "__main__":
print "
Recursive listing of all paths in a dir:"
for path in enumeratepaths():
    print path
print "
Recursive listing of all files in dir:"
for file in enumeratefiles():
    print file
print "
Recursive listing of all dirs in dir:"
for dir in enumeratedir():
    print dir

On a Mac laptop, the output of this script looks like this:

[ngift@Macintosh-7][H:12022][J:0]# python enumarate_file_dir_path.py
    
Recursive listing of all paths in a dir:
/tmp/.aksusb
/tmp/ARD_ABJMMRT
/tmp/com.hp.launchport
/tmp/error.txt
/tmp/liten.py
/tmp/LitenDeplicationReport.csv
/tmp/ngift.liten.log
/tmp/hsperfdata_ngift/58920
/tmp/launch-h36okI/Render
/tmp/launch-qy1S9C/Listeners
/tmp/launch-RTJzTw/:0
/tmp/launchd-150.wDvODl/sock
    
Recursive listing of all files in dir:
.aksusb
ARD_ABJMMRT
com.hp.launchport
error.txt
liten.py
LitenDeplicationReport.csv
ngift.liten.log
58920
Render
Listeners
:0
sock
    
Recursive listing of all dirs in dir:
.X11-unix
hsperfdata_ngift
launch-h36okI
launch-qy1S9C
launch-RTJzTw
launchd-150.wDvODl
ssh-YcE2t6PfnO

A note about the previous code snippet—os.walk returns a generator object, so if you call pass a value to os.walk, you can walk a tree yourself:

In [2]: import os
     
In [3]: os.walk("/tmp")
Out[3]: [generator object at 0x508e18]

This is what it looks like when it is run from IPython. You will notice using a generator gives us the ability to call path.next(). We won’t get into the nitty gritty details about generators, but it is important to know that os.walk returns a generator object. Generators are tremendously useful for systems programming. Visit David Beazely’s website (http://www.dabeaz.com/generators/) to find out all you need to know about them.

In [2]: import os
    
In [3]: os.walk("/tmp")
Out[3]: [generator object at 0x508e18]
        
    In [4]: path = os.walk("/tmp")
        
    In [5]: path.
    path.__class__         path.__init__          path.__repr__          path.gi_running
    path.__delattr__       path.__iter__          path.__setattr__       path.next
    path.__doc__           path.__new__           path.__str__           path.send
    path.__getattribute__  path.__reduce__        path.close             path.throw
    path.__hash__          path.__reduce_ex__     path.gi_frame          
        
    In [5]: path.next()
    Out[5]: 
    ('/tmp',
    ['.X11-unix',
    'hsperfdata_ngift',
    'launch-h36okI',
    'launch-qy1S9C',
    'launch-RTJzTw',
    'launchd-150.wDvODl',
    'ssh-YcE2t6PfnO'],
    ['.aksusb',
    'ARD_ABJMMRT',
    'com.hp.launchport',
    'error.txt',
    'liten.py',
    'LitenDeplicationReport.csv',
    'ngift.liten.log'])

In a bit, we will look at generators in more detail, but let’s first make a cleaner module that gives us files, directories, and paths in a clean API.

Now that we have walked a very basic directory, let’s make this an object-oriented module so that we can easily import and reuse it again. It will take a small amount of work to make a hardcoded module, but a generic module that we can reuse later will certainly make our lives easier. See Example 6-6.

Example 6-6. Creating reusable directory walking module

import os

class diskwalk(object):
"""API for getting directory walking collections"""
def __init__(self, path):
    self.path = path

def enumeratePaths(self):
    """Returns the path to all the files in a directory as a list"""
    path_collection = []
    for dirpath, dirnames, filenames in os.walk(self.path):
        for file in filenames:
            fullpath = os.path.join(dirpath, file)
            path_collection.append(fullpath)

    return path_collection

def enumerateFiles(self):
    """Returns all the files in a directory as a list"""
    file_collection = []
    for dirpath, dirnames, filenames in os.walk(self.path):
        for file in filenames:
        file_collection.append(file)

    return file_collection

def enumerateDir(self):
    """Returns all the directories in a directory as a list"""
    dir_collection = []
    for dirpath, dirnames, filenames in os.walk(self.path):
        for dir in dirnames:
            dir_collection.append(dir)

    return dir_collection

As you can see, with a few small modifications, we were able to make a very nice interface for future modifications. One of the nice things about this new module is that we can import it into another script.

Comparing Data

Comparing data is quite important to a sysadmin. Questions you might often ask yourself are, “What files are different between these two directories? How many copies of this same file exist on my system?” In this section, you will find the ways to answer those questions and more.

When dealing with massive quantities of important data, it often is necessary to compare directory trees and files to see what changes have been made. This becomes critical if you start writing large data mover scripts. The absolute doomsday scenario is to write a large data move script that damages critical production data.

In this section, we will first explore a few lightweight methods to compare files and directories and then move on to eventually exploring doing checksum comparisons of files. The Python Standard Library has several modules that assist with comparisons and we will be covering filecmp and os.listdir.

Using the filecmp Module

The filecmp module contains functions for doing fast and efficient comparisons of files and directories. The filecmp module will perform a os.stat on two files and return a True if the results of os.stat are the same for both files or a False if the results are not. Typically, os.stat is used to determine whether or not two files use the same inodes on a disk and whether they are the same size, but it does not actually compare the contents.

In order to fully understand how filecmp works, we need to create three files from scratch. To do this on computer, change into the /tmp directory, make a file called file0.txt, and place a “0” in the file. Next, create a file called file1.txt, and place a “1” in that file. Finally, create a file called file00.txt, and place a “0” in it. We will use these files as examples in the following code:

In [1]: import filecmp   
              
In [2]: filecmp.cmp("file0.txt", "file1.txt")
Out[2]: False
                 
In [3]: filecmp.cmp("file0.txt", "file00.txt")
Out[3]: True

As you can see, the cmp function returned True in the case of file0.txt and file00.txt, and False when file1.txt was compared with file0.txt.

The dircmp function has a number of attributes that report differences between directory trees. We won’t go over every attribute, but we have created a few examples of useful things you can do. For this example, we created two subdirectories in the /tmp directory and copied the files from our previous example into each directory. In dirB, we created one extra file named file11.txt, into which we put “11”:

In [1]: import filecmp
                
In [2]: pwd
Out[2]: '/private/tmp'
                
In [3]: filecmp.dircmp("dirA", "dirB").diff_files
Out[3]: []
                
In [4]: filecmp.dircmp("dirA", "dirB").same_files
Out[4]: ['file1.txt', 'file00.txt', 'file0.txt']
                
In [5]: filecmp.dircmp("dirA", "dirB").report()
diff dirA dirB
Only in dirB : ['file11.txt']
Identical files : ['file0.txt', 'file00.txt', 'file1.txt']

You might be a bit surprised to see here that there were no matches for diff_files even though we created a file11.txt that has unique information in it. The reason is that diff_files compares only the differences between files with the same name.

Next, look at the output of same_files, and notice that it only reports back files that are identical in two directories. Finally, we can generate a report as shown in the last example. It has a handy output that includes a breakdown of the differences between the two directories. This brief overview is just a bit of what the filecmp module can do, so we recommend taking a look at the Python Standard Library documentation to get a full overview of the features we did not have space to cover.

Using os.list

Another lightweight method of comparing directories is to use os.listdir. You can think of os.listdir as an ls command that returns a Python list of the files found. Because Python supports many interesting ways to deal with lists, you can use os.listdir to determine differences in a directory yourself, quite simply by converting your list into a set and then subtracting one set from another. Here is an example of what this looks like in IPython:

In [1]: import os
          
In [2]: dirA = set(os.listdir("/tmp/dirA"))
            
In [3]: dirA
Out[3]: set(['file1.txt', 'file00.txt', 'file0.txt'])
            
In [4]: dirB = set(os.listdir("/tmp/dirB"))
            
In [5]: dirB
Out[5]: set(['file1.txt', 'file00.txt', 'file11.txt', 'file0.txt'])
            
In [6]: dirA - dirB
Out[6]: set([])
            
In [7]: dirB-dirA
Out[7]: set(['file11.txt'])

From this example, you can see that we used a neat trick of converting two lists into sets and then subtracting the sets to find the differences. Notice that line [7] returns file11.txt because dirB is a superset of dirA, but in line [6] the results are empty because dirA contains all of the same items as dirB. Using sets makes it easy to create a simple merge of two data structures as well, by subtracting the full paths of one directory against another and then copying the difference. We will discuss merging data in the next section.

This approach has some very large limitations though. The actual name of a file is often misleading, as it is possible to have a file that is 0k that has the same name as a file with 200 GBs. In the next section, we cover a better approach to finding the differences between two directories and merging the contents together.

Merging Data

What can you do when you don’t want to simply compare data files, but you would like to merge two directory trees together? A problem often can occur when you want to merge the contents of one tree into another without creating any duplicates.

You could just blindly copy the files from one directory into your target directory, and then deduplicate the directory, but it would be more efficient to prevent the duplicates in the first place. One reasonably simple solution would be to use the filecmp module’s dircmp function to compare two directories, and then copy the unique results using the os.listdir technique described earlier. A better choice would be to use MD5 checksums, which we explain in the next section.

MD5 Checksum Comparisons

Performing a MD5 checksum on a file and comparing it to another file is like going target shooting with a bazooka. It is the big weapon you pull out when you want to be sure of what you are doing, although a byte-by-byte comparison is truly 100 percent accurate. Example 6-7 shows how the function takes in a path to a file and returns a checksum.

Example 6-7. Performing an MD5 checksum on files

import hashlib

def create_checksum(path):
    """
    Reads in file. Creates checksum of file line by line.
    Returns complete checksum total for file.
    """
    fp = open(path)
    checksum = hashlib.md5()
    while True:
        buffer = fp.read(8192)
        if not buffer:break
        checksum.update(buffer)
    fp.close()
    checksum = checksum.digest()
    return checksum

Here is an iterative example that uses this function with IPython to compare two files:

In [2]: from checksum import createChecksum
           
In [3]: if createChecksum("image1") == createChecksum("image2"):
...:     print "True"
...:     
...:     
True

In [5]: if createChecksum("image1") == createChecksum("image_unique"):
print "True"
...:     
...:

In that example, the checksums of the files were manually compared, but we can use the code we wrote earlier that returns a list of paths to recursively compare a directory tree full of files and gives us duplicates. One of the other nice things about creating a reasonable API is that we can now use IPython to interactively test our solution. Then, if it works, we can create another module. Example 6-8 shows the code for finding the duplicates.

Example 6-8. Performing an MD5 checksum on a directory tree to find duplicates

In [1]: from checksum import createChecksum
 
In [2]: from diskwalk_api import diskwalk
 
In [3]: d = diskwalk('/tmp/duplicates_directory')
 
In [4]: files = d.enumeratePaths()
 
In [5]: len(files)
Out[5]: 12
 
In [6]: dup = []
 
In [7]: record = {}
 
In [8]: for file in files:
   compound_key = (getsize(file),create_checksum(file))
   if compound_key in record:
       dup.append(file)
   else:
      
            record[compound_key] = file

....:         
....:
In [9]: print dup
['/tmp/duplicates_directory/image2']

The only portion of this code that we haven’t looked at in previous examples is found on line [8]. We create an empty dictionary and then use a key to store the checksum we generate. This can serve as a simple way to determine whether or not that checksum has been seen before. If it has, then we toss the file into a dup list. Now, let’s separate this into a piece of code we can use again. After all that is quite useful. Example 6-9 shows how to do that.

Example 6-9. Finding duplicates

from checksum import create_checksum
from diskwalk_api import diskwalk
from os.path import getsize

def findDupes(path = '/tmp'):
 dup = []
 record = {}
 d = diskwalk(path)
 files = d.enumeratePaths()
 for file in files:
     compound_key = (getsize(file),create_checksum(file))
     if compound_key in record:
         dup.append(file)
     else:
         #print "Creating compound key record:", compound_key
         record[compound_key] = file
 return dup

if __name__ == "__main__":
dupes = findDupes()
for dup in dupes:
    print “Duplicate: %s” % dup

When we run that script, we get the following output:

[ngift@Macintosh-7][H:10157][J:0]# python find_dupes.py
Duplicate: /tmp/duplicates_directory/image2

We hope you can see that this shows what even a little bit of code reuse can accomplish. We now have a generic module that will take a directory tree and return a list of all the duplicate files. This is quite handy in and of itself, but next we can take this one step further and automatically delete the duplicates.

Deleting files in Python is simple, as you can use os.remove (file). In this example, we have a number of 10 MB files in our /tmp directory; let’s try to delete one of them using os.remove (file):

In [1]: import os
          
In [2]: os.remove("10
10mbfile.0  10mbfile.1  10mbfile.2  10mbfile.3  10mbfile.4  
10mbfile.5  10mbfile.6  10mbfile.7  10mbfile.8
    
In [2]: os.remove("10mbfile.1")

In [3]: os.remove("10
10mbfile.0  10mbfile.2  10mbfile.3  10mbfile.4  10mbfile.5  
10mbfile.6  10mbfile.7  10mbfile.8

Notice that tab completion in IPython allows us to see the matches and fills out the names of the image files for us. Be aware that the os.remove (file) method is silent and permanent, so this might or might not be what you want to do. With that in mind, we can implement an easy method to delete our duplicates, and then enhance it after the fact. Because it is so easy to test interactive code with IPython, we are going to write a test function on the fly and try it:

In [1]: from find_dupes import findDupes
    
In [2]: dupes = findDupes("/tmp")
    
In [3]: def delete(file):
             import os
...:     print "deleting %s" % file
...:     os.remove(file)
...:     
...:     
    
In [4]: for dupe in dupes:
...:     delete(dupe)
...:     
...:     
In [5]: for dupe in dupes:
delete(dupe)
...:     
...:     
deleting /tmp/10mbfile.2
deleting /tmp/10mbfile.3
deleting /tmp/10mbfile.4
deleting /tmp/10mbfile.5
deleting /tmp/10mbfile.6
deleting /tmp/10mbfile.7
deleting /tmp/10mbfile.8

In this example, we added some complexity to our delete method by including a print statement of the files we automatically deleted. Just because we created a whole slew of reusable code, it doesn’t mean we need to stop now. We can create another module that does fancy delete-related things when it is a file object. The module doesn’t even need to be tied to duplicates, it can be used to delete anything. See Example 6-10.

Example 6-10. Delete module

     #!/usr/bin/env python
     import os

     class Delete(object):
     """Delete Methods For File Objects"""

     def __init__(self, file):
        self.file = file

     def interactive(self):
        """interactive deletion mode"""

        input = raw_input("Do you really want to delete %s [N]/Y" % self.file)
        if input.upper():
            print "DELETING:  %s" % self.file
            status = os.remove(self.file)
        else:
            print "Skipping:  %s" % self.file
        return
     def dryrun(self):
        """simulation mode for deletion"""

        print "Dry Run:  %s [NOT DELETED]" % self.file
        return 

     def delete(self):

        """Performs a delete on a file, with additional conditions
        """

        print "DELETING:  %s" % self.file
        try:
            status = os.remove(self.file)
        except Exception, err:
            print err
            return status

     if __name__ == "__main__":
        from find_dupes import findDupes
        dupes = findDupes('/tmp')

     for dupe in dupes:
        delete = Delete(dupe)
        #delete.dryrun()
        #delete.delete()
        #delete.interactive()

In this module, you will see three different types of deletes. The interactive delete method prompts the user to confirm each file he is going to delete. This can seem a bit annoying, but it is good protection when other programmers will be maintaining and updating the code.

The dry run method simulates a deletion. And, finally, there is an actual delete method that will permanently delete your files. At the bottom of the module, you can see that there is a commented example of the ways to use each of these three different methods. Here is an example of each method in action:

Dry run

ngift@Macintosh-7][H:10197][J:0]# python delete.py
Dry Run:  /tmp/10mbfile.1 [NOT DELETED]
Dry Run:  /tmp/10mbfile.2 [NOT DELETED]
Dry Run:  /tmp/10mbfile.3 [NOT DELETED]
Dry Run:  /tmp/10mbfile.4 [NOT DELETED]
Dry Run:  /tmp/10mbfile.5 [NOT DELETED]
Dry Run:  /tmp/10mbfile.6 [NOT DELETED]
Dry Run:  /tmp/10mbfile.7 [NOT DELETED]
Dry Run:  /tmp/10mbfile.8 [NOT DELETED]

Interactive

ngift@Macintosh-7][H:10201][J:0]# python delete.py
Do you really want to delete /tmp/10mbfile.1 [N]/YY
DELETING:  /tmp/10mbfile.1
Do you really want to delete /tmp/10mbfile.2 [N]/Y
Skipping:  /tmp/10mbfile.2
Do you really want to delete /tmp/10mbfile.3 [N]/Y

Delete

[ngift@Macintosh-7][H:10203][J:0]# python delete.py      
DELETING:  /tmp/10mbfile.1
DELETING:  /tmp/10mbfile.2
DELETING:  /tmp/10mbfile.3
DELETING:  /tmp/10mbfile.4
DELETING:  /tmp/10mbfile.5
DELETING:  /tmp/10mbfile.6
DELETING:  /tmp/10mbfile.7
DELETING:  /tmp/10mbfile.8

You might find using encapsulation techniques like this very handy when dealing with data because you can prevent a future problem by abstracting what you are working on enough to make it nonspecific to your problem. In this situation, we wanted to automatically delete duplicate files, so we created a module that generically finds filenames and deletes them. We could make another tool that generically takes file objects and applies some form of compression as well. We are actually going to get to that example in just a bit.

Pattern Matching Files and Directories

So far you have seen how to process directories and files, and perform actions such as finding duplicates, deleting directories, moving directories, and so on. The next step in mastering the directory tree is to use pattern matching, either alone or in combination with these previous techniques. As just about everything else in Python, performing a pattern match for a file extension or filename is simple. In this section, we will demonstrate a few common pattern matching problems and apply the techniques used earlier to create simple, yet powerful reusable tools.

A fairly common problem sysadmins need to solve is that they need to track down and delete, move, rename, or copy a certain file type. The most straightforward approach to doing this in Python is to use either the fnmatch module or the glob module. The main difference between these two modules is that fnmatch returns a True or False for a Unix wildcard, and glob returns a list of pathnames that match a pattern. Alternatively, regular expressions can be used to create more sophisticated pattern matching tools. Please refer to Chapter 3 to get more detailed instructions on using regular expressions to match patterns.

Example 6-11 will look at how fnmatch and glob can be used. We will reuse the code we’ve been working on by importing diskwalk from the diskwalk_api module.

Example 6-11. Interactively using fnmatch and glob to search for file matches

In [1]: from diskwalk_api import diskwalk

In [2]: files = diskwalk("/tmp")

In [3]: from fnmatch import fnmatch

In [4]: for file in files:
   ...:     if fnmatch(file,"*.txt"):
   ...:         print file
   ...:         
   ...:         
/tmp/file.txt

In [5]: from glob import glob

In [6]: import os

In [7]: os.chdir("/tmp")

In [8]: glob("*")
Out[8]: ['file.txt', 'image.iso', 'music.mp3']

In the previous example, after we reused our previous diskwalk module, we received a list of all of the full paths located in the /tmp directory. We then used fnmatch to determine whether each file matched the pattern “*.txt”. The glob module is a bit different, in that it will literally “glob,” or match a pattern, and return the full path. Glob is a much higher-level function than fnmatch, but both are very useful tools for slightly different jobs.

The fnmatch function is particularly useful when it is combined with other code to create a filter to search for data in a directory tree. Often, when dealing with directory trees, you will want to work with files that match certain patterns. To see this in action, we will solve a classic sysadmin problem by renaming all of the files that match a pattern in a directory tree. Keep in mind that it is just as simple to rename files as it is to delete, compress, or process them. There is a simple pattern here:

Get the path to a file in a directory.
Perform some optional layer of filtering; this could involve many filters, such as filename, extension, size, uniqueness, and so on.
Perform an action on them; this could be copying, deleting, compressing, reading, and so on. Example 6-12 shows how to do this.

Example 6-12. Renaming a tree full of MP3 files to text files

In [1]: from diskwalk_api import diskwalk

In [2]: from shutil import move

In [3]: from fnmatch import fnmatch

In [4]: files = diskwalk("/tmp")

In [5]: for file in files:
            if fnmatch(file, "*.mp3"):
            #here we can do anything we want, delete, move, rename...hmmm rename
                move(file, "%s.txt" % file)
In [6]: ls -l /tmp/
total 0
-rw-r--r--  1 ngift  wheel  0 Apr  1 21:50 file.txt
-rw-r--r--  1 ngift  wheel  0 Apr  1 21:50 image.iso
-rw-r--r--  1 ngift  wheel  0 Apr  1 21:50 music.mp3.txt
-rw-r--r--  1 ngift  wheel  0 Apr  1 22:45 music1.mp3.txt
-rw-r--r--  1 ngift  wheel  0 Apr  1 22:45 music2.mp3.txt
-rw-r--r--  1 ngift  wheel  0 Apr  1 22:45 music3.mp3.txt

Using code we already wrote, we used four lines of very readable Python code to rename a tree full of mp2 files to text files. If you are one of the few sysadmins who has not read at least one episode of BOFH, or Bastard Operator From Hell, it might not be immediately obvious what we could do next with our bit of code.

Imagine you have a production file server that is strictly for high-performance disk I/O storage, and it has a limited capacity. You have noticed that it often gets full because one or two abusers place hundreds of GBs of MP3 files on it. You could put a quota on the amount of file space each user can access, of course, but often that is more trouble than it is worth. One solution would be to create a cron job every night that finds these MP3 files, and does “random” things to them. On Monday it could rename them to text files, on Tuesday it could compress them into ZIP files, on Wednesday it could move them all into the /tmp directory, and on Thursday it could delete them, and send the owner of the file an emailed list of all the MP3 files it deleted. We would not suggest doing this unless you own the company you work for, but for the right BOFH, the earlier code example is a dream come true.

Wrapping Up rsync

As you might well already know, rsync is a command-line tool that was originally written by Andrew Tridgell and Paul Mackerra. Late in 2007, rsync version 3 was released for testing and it includes an even greater assortment of options than the original version.

Over the years, we have found ourselves using rsync as the primary tool to move data from point A to point B. The manpage and options are staggering works, so we recommend that you read through them in detail. Rsync may just be the single most useful command-line tool ever written for systems administrators.

With that being said, there are some ways that Python can help control, or glue rsync’s behavior. One problem that we have encountered is ensuring that data gets copied at a scheduled time. We have been in many situations in which we needed to synchronize TBs of data as quickly as possible between one file server and another, but we did not want to monitor the process manually. This is a situation in which Python really makes a lot of sense.

With Python you can add a degree of artificial intelligence to rsync and customize it to your particular needs. The point of using Python as glue code is that you make Unix utilities do things they were never intended to do, and so you make highly flexible and customizable tools. The limit is truly only your imagination. Example 6-13 shows a very simple example of how to wrap rsync.

Example 6-13. Simple wrap of rsync

#!/usr/bin/env python
#wraps up rsync to synchronize two directories
from subprocess import call
import sys

source = "/tmp/sync_dir_A/" #Note the trailing slash
target = "/tmp/sync_dir_B"
rsync = "rsync"
arguments = "-a"
cmd = "%s %s %s %s" % (rsync, arguments, source, target)


def sync():

ret = call(cmd, shell=True)
if ret !=0:
    print "rsync failed"
    sys.exit(1)
sync()

This example is hardcoded to synchronize two directories and to print out a failure message if the command does not work. We could do something a bit more interesting, though, and solve a problem that we have frequently run into. We have often found that we are called upon to synchronize two very large directories, and we don’t want to monitor data synchronization overnight. But if you don’t monitor the synchronization, you can find that it disrupted partway through the process, and quite often the data, along with a whole night, is wasted, and the process needs to start again the next day. Using Python, you can create a more aggressive, highly motivated rsync command.

What would a highly motivated rsync command do exactly? Well, it would do what you would do if you were monitoring the synchronization of two directories: it would continue trying to synchronize the directories until it finished, and then it would send an email saying it was done. Example 6-14 shows the rsync code of our little over achiever in action.

Example 6-14. An rsync command that doesn’t quit until the job is finished

#!/usr/bin/env python
#wraps up rsync to synchronize two directories
from subprocess import call
import sys
import time

"""this motivated rsync tries to synchronize forever"""

source = "/tmp/sync_dir_A/" #Note the trailing slash
target = "/tmp/sync_dir_B"
rsync = "rsync"
arguments = "-av"
cmd = "%s %s %s %s" % (rsync, arguments, source, target)

def sync():
while True:
    ret = call(cmd, shell=True)
    if ret !=0:
        print "resubmitting rsync"
        time.sleep(30)
    else:
        print "rsync was succesful"
        subprocess.call("mail -s 'jobs done' [email protected]", shell=True)
        sys.exit(0)
sync()
                                      
        </literallayout>
</example>

This is overly simplified and contains hardcoded data, but it is an example of the kind of useful tool you can develop to automate something you normally need to monitor manually. There are some other features you can include, such as the ability to set the retry interval and limit as well as the ability to check for disk usage on the machine to which you are connecting and so on.

Metadata: Data About Data

Most systems administrators get to the point where they start to be concerned, not just about data, but about the data about the data. Metadata, or data about data, can often be more important than the data itself. To give an example, in film and television, the same data often exists in multiple locations on a filesystem or even on several filesystems. Keeping track of the data often involves creating some type of metadata management system.

It is the data about how those files are organized and used, though, that can be the most critical to an application, to an animation pipeline, or to restore a backup. Python can help here, too, as it is easy to both use metadata and write metadata with Python.

Let’s look at using a popular ORM, SQLAlchemy, to create some metadata about a filesystem. Fortunately, the documentation for SQLAlchemy is very good, and SQLAlchemy works with SQLite. We think this is a killer combination for creating custom metadata solutions.

In the examples above, we walked a filesystem in real time and performed actions and queries on paths that we found. While this is incredibly useful, it is also time-consuming to search a large filesystem consisting of millions of files to do just one thing. In Example 6-15, we show what a very basic metadata system could look like by combining the directory walking techniques we have just mastered with an ORM.

Example 6-15. Creating metadata about a filesystem with SQLAlchemy

#!/usr/bin/env python
from sqlalchemy import create_engine
from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey
from sqlalchemy.orm import mapper, sessionmaker
import os

#path
path = " /tmp" 

#Part 1:  create engine
engine = create_engine('sqlite:///:memory:', echo=False)

#Part 2:  metadata
metadata = MetaData()

filesystem_table = Table('filesystem', metadata,
   Column('id', Integer, primary_key=True),
   Column('path', String(500)),
   Column('file', String(255)),
)

metadata.create_all(engine)

#Part 3:  mapped class
class Filesystem(object):

def __init__(self, path, file):
    self.path = path
    self.file = file

def __repr__(self):
    return "[Filesystem('%s','%s')]" % (self.path, self.file)

#Part 4:  mapper function
mapper(Filesystem,filesystem_table)

#Part 5:  create session
Session = sessionmaker(bind=engine, autoflush=True, transactional=True)
session = Session()

#Part 6:  crawl file system and populate database with results
for dirpath, dirnames, filenames in os.walk(path):
    for file in filenames:
        fullpath = os.path.join(dirpath, file)
        record = Filesystem(fullpath, file)
        session.save(record)

#Part 7:  commit to the database
session.commit()

#Part 8:  query
for record in session.query(Filesystem):
    print "Database Record Number: %s, Path: %s , File: %s " 
    % (record.id,record.path, record.file)

It would be best to think about this code as a set of procedures that are followed one after another. In part one, we create an engine, which is really just a fancy way of defining the database we are going to use. In part two, we define a metadata instance, and create our database tables. In part three, we create a class that will map to the tables in the database that we created. In part four, we call a mapper function that puts the ORM; it actually maps this class to the tables. In part five, we create a session to our database. Notice that there are a few keyword parameters that we set, including autoflush and transactional.

Now that we have the very explicit ORM setup completed, in part six, we do our usual song and dance, and grab the filenames and complete paths while we walk a directory tree. There are a couple of twists this time, though. Notice that we create a record in the database for each fullpath and file we encounter, and that we then save each newly created record as it is created. We then commit this transaction to our “in memory” SQLite database in part seven.

Finally, in part eight, we perform a query, in Python, of course, that returns the results of the records we placed in the database. This example could potentially be a fun way to experiment with creating custom SQLAlchemy metadata solutions for your company or clients. You could expand this code to do something interesting, such as perform relational queries or write results out to a file, and so on.

Archiving, Compressing, Imaging, and Restoring

Dealing with data in big chunks is a problem that sysadmins have to face every day. They often use tar, dd, gzip, bzip, bzip2, hdiutil, asr, and other utilities to get their jobs done.

Believe it or not, the “batteries included” Python Standard Library has built-in support for TAR files, zlib files, and gzip files. If compression and archiving is your goal, then you will not have any problem with the rich tools Python has to offer. Let’s look at the grandaddy of all archive packages: tar; and we we’ll see how the standard library implements tar.

Using tarfile Module to Create TAR Archives

Creating a TAR archive is quite easy, almost too easy in fact. In Example 6-16, we create a very large file as an example. Note, the syntax is much more user friendly than even the tar command itself.

Example 6-16. Create big text file

In [1]: f = open("largeFile.txt", "w")

In [2]: statement = "This is a big line that I intend to write over and over again."
ln [3]: x = 0
In [4]: for x in xrange(20000):
....:     x += 1
....:     f.write("%s
" % statement)
....:
....:
In [4]: ls -l
-rw-r--r-- 1 root root 1236992 Oct 25 23:13 largeFile.txt

OK, now that we have a big file full of junk, let’s TAR that baby up. See Example 6-17.

Example 6-17. TAR up contents of file

In [1]: import tarfile

In [2]: tar = tarfile.open("largefile.tar", "w")

In [3]: tar.add("largeFile.txt")

In [4]: tar.close()

In [5]: ll

-rw-r--r-- 1 root root 1236992 Oct 25 23:15 largeFile.txt
-rw-r--r-- 1 root root 1236992 Oct 26 00:39 largefile.tar

So, as you can see, this makes a vanilla TAR archive in a much easier syntax than the regular tar command. This certainly makes the case for using the IPython shell to do all of your daily systems administration work.

While it is handy to be able to create a TAR file using Python, it is almost useless to TAR up only one file. Using the same directory walking pattern we have used numerous times in this chapter, we can create a TAR file of the whole /tmp directory by walking the tree and then adding each file to the contents of the /tmp directory TAR. See Example 6-18.

Example 6-18. TAR up contents of a directory tree

In [27]: import tarfile
                
In [28]: tar = tarfile.open("temp.tar", "w")
                
In [29]: import os
                
In [30]: for root, dir, files in os.walk("/tmp"):
....:     for file in filenames:
....: 
KeyboardInterrupt
                
In [30]: for root, dir, files in os.walk("/tmp"):
for file in files:
....:         fullpath = os.path.join(root,file)
....:         tar.add(fullpath)
....:         
....:         
                
In [33]: tar.close()

It is quite simple to add the contents of a directory tree by walking a directory, and it is a good pattern to use, because it can be combined with some of the other techniques we have covered in this chapter. Perhaps you are archiving a directory full of media files. It seems silly to archive exact duplicates, so perhaps you want to replace duplicates with symbolic links before you create a TAR file. With the information in this chapter, you can easily build the code that will do just that and save quite a bit of space.

Since doing a generic TAR archive is a little bit boring, let’s spice it up a bit and add bzip2 compression, which will make your CPU whine and complain at how much you are making it work. The bzip2 compression algorithm can do some really funky stuff. Let’s look at an example of how impressive it can truly be.

Then get real funky and make a 60 MB text file shrink down to 10 K! See Example 6-19.

Example 6-19. Creating bzip2 TAR archive

In [1: tar = tarfile.open("largefilecompressed.tar.bzip2", "w|bz2")

In [2]: tar.add("largeFile.txt")

In [3]: ls -h
foo1.txt fooDir1/ largeFile.txt largefilecompressed.tar.bzip2*
foo2.txt fooDir2/ largefile.tar

ln [4]: tar.close()

In [5]: ls -lh

-rw-r--r-- 1 root root 61M Oct 25 23:15 largeFile.txt
-rw-r--r-- 1 root root 61M Oct 26 00:39 largefile.tar
-rwxr-xr-x 1 root root 10K Oct 26 01:02 largefilecompressed.tar.bzip2*

What is amazing is that bzip2 was able to compress our 61 M text file into 10 K, although we did cheat quite a bit using the same data over and over again. This didn’t come at zero cost of course, as it took a few minutes to compress this file on a dual core AMD system.

Let’s go the whole nine yards and do a compressed archive with the rest of the available options, starting with gzip next. The syntax is only slightly different. See Example 6-20.

Example 6-20. Creating a gzip TAR archive

In [10]: tar = tarfile.open("largefile.tar.gzip", "w|gz")

In [11]: tar.add("largeFile.txt")

ln [12]: tar.close()

In [13]: ls -lh

-rw-r--r-- 1 root root  61M Oct 26 01:20 largeFile.txt
-rw-r--r-- 1 root root  61M Oct 26 00:39 largefile.tar
-rwxr-xr-x 1 root root 160K Oct 26 01:24 largefile.tar.gzip*

A gzip archive is still incredibly small, coming in at 160 K, but on my machine it was able to create this compressed TAR file in seconds. This seems like a good trade-off in most situations.

Using a tarfile Module to Examine the Contents of TAR Files

Now that we have a tool that creates TAR files, it only makes sense to examine the TAR file’s contents. It is one thing to blindly create a TAR file, but if you have been a systems administrator for any length of time, you have probably gotten burned by a bad backup, or have been accused of making a bad backup.

To put this situation in perspective and highlight the importance of examining TAR archives, we will share a story about a fictional friend of ours, let’s call it The Case of the Missing TAR Archive. Names, identities, and facts, are fictional; if this story resembles reality, it is completely coincidental.

OUr friend worked at a major television studio as a systems administrator and was responsible for supporting a department led by a real crazy man. This man had a reputation for not telling the truth, acting impulsively, and well, being crazy. If a situation arose where the crazy man was at fault, like he missed a deadline with a client, or didn’t produce a segment according to the specifications he was given, he would gladly just lie and blame it on someone else. Often times, that someone else was our friend, the systems administrator.

Unfortunately, our friend, was responsible for maintaining this lunatic’s backups. His first thought was it was time to look for a new job, but he had worked at this studio for many years, and had many friends, and didn’t want to waste all that on this temporarily bad situation. He needed to make sure he covered all of his bases and so instituted a logging system that categorized the contents of all of the automated TAR archives that were created for the crazy man, as he felt it was only a matter of time before he would get burned when the crazy man missed a deadline, and needed an excuse.

One day our friend, William, gets a call from his boss, “William I need to see you in my office immediately, we have a situation with the backups.” William, immediately walked over to his office, and was told that the crazy man, Alex, had accused William of damaging the archive to his show, and this caused him to miss a deadline with his client. When Alex missed deadlines with his client, it made Alex’s boss Bob, very upset.

William was told by his boss that Alex had told him the backup contained nothing but empty, damaged files, and that he had been depending on that archive to work on his show. William then told his boss, he was certain that he would eventually be accused of messing up an archive, and had secretly written some Python code that inspected the contents of all the TAR archives he had made and created extended information about the attributes of the files before and after they were backed up. It turned out that Alex had never created a show to begin with and that there was an empty folder being archived for months.

When Alex was confronted with this information, he quickly backpeddled and looked for some way to shift attention onto a new issue. Unfortunately for Alex, this was the last straw and a couple of months later, he never showed up to work. He may have either left or been fired, but it didn’t matter, our friend, William had solved, The Case of the Missing TAR Archive.

The moral of this story is that when you are dealing with backups, treat them like nuclear weapons, as backups are fraught with danger in ways you might not even imagine.

Here are several methods to examine the contents of the TAR file we created earlier:

In [1]: import tarfile
             
In [2]: tar = tarfile.open("temp.tar","r")
             
In [3]: tar.list()
-rw-r--r-- ngift/wheel          2 2008-04-04 15:17:14 tmp/file00.txt
-rw-r--r-- ngift/wheel          2 2008-04-04 15:15:39 tmp/file1.txt
-rw-r--r-- ngift/wheel          0 2008-04-04 20:50:57 tmp/temp.tar
-rw-r--r-- ngift/wheel          2 2008-04-04 16:19:07 tmp/dirA/file0.txt
-rw-r--r-- ngift/wheel          2 2008-04-04 16:19:07 tmp/dirA/file00.txt
-rw-r--r-- ngift/wheel          2 2008-04-04 16:19:07 tmp/dirA/file1.txt
-rw-r--r-- ngift/wheel          2 2008-04-04 16:19:52 tmp/dirB/file0.txt
-rw-r--r-- ngift/wheel          2 2008-04-04 16:19:52 tmp/dirB/file00.txt
-rw-r--r-- ngift/wheel          2 2008-04-04 16:19:52 tmp/dirB/file1.txt
-rw-r--r-- ngift/wheel          3 2008-04-04 16:21:50 tmp/dirB/file11.txt
             
In [4]: tar.name
Out[4]: '/private/tmp/temp.tar'
             
In [5]: tar.getnames()
Out[5]: 
['tmp/file00.txt',
'tmp/file1.txt',
'tmp/temp.tar',
'tmp/dirA/file0.txt',
'tmp/dirA/file00.txt',
'tmp/dirA/file1.txt',
'tmp/dirB/file0.txt',
'tmp/dirB/file00.txt',
'tmp/dirB/file1.txt',
'tmp/dirB/file11.txt']
             
In [10]: tar.members
Out[10]: 
[<TarInfo 'tmp/file00.txt' at 0x109eff0>,
 <TarInfo 'tmp/file1.txt' at 0x109ef30>,
 <TarInfo 'tmp/temp.tar' at 0x10a4310>,
 <TarInfo 'tmp/dirA/file0.txt' at 0x10a4350>,
 <TarInfo 'tmp/dirA/file00.txt' at 0x10a43b0>,
 <TarInfo 'tmp/dirA/file1.txt' at 0x10a4410>,
 <TarInfo 'tmp/dirB/file0.txt' at 0x10a4470>,
 <TarInfo 'tmp/dirB/file00.txt' at 0x10a44d0>,
 <TarInfo 'tmp/dirB/file1.txt' at 0x10a4530>,
 <TarInfo 'tmp/dirB/file11.txt' at 0x10a4590>]

Those examples show how to examine the names of the files in the TAR archive, which could be validated in data verification script. Extracting files is not much more work. If you want to extract a whole TAR archive to the current working directory, you can simply use the following:

In [60]: tar.extractall()
             
drwxrwxrwx  7 ngift  wheel    238 Apr  4 22:59 tmp/

If you are extremely paranoid, and you should be, then you could also include a step that extracts the contents of the archives and performs random MD5 checksums on files from the archive and compare them against MD5 checksums you made on the file before it was backed up. This could be a very effective way to monitor whether the integrity of the data is what you expect it to be.

No sane archiving solution should just trust that an archive was created properly. At the very least, random spot checking of archives needs to be done automatically. At best, every single archive should be reopened and checked for validity after it has been created.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Data

Create new playlist

Sign In

Sign Up