The need to control dealing with data, files, and directories is one of the reasons IT organizations need sysadmins. What sysadmin hasn’t had the need to process all of the files in a directory tree and parse and replace text? And if you haven’t written a script yet that renames all of the files in a directory tree, you probably will at some point in the future. These abilities are the essence of what it means to be a sysadmin, or at least to be a really good sysadmin. For the rest of this chapter, we’re going to focus on data, files, and directories.
Sysadmins need to constantly wrangle data from one location to the next. The movement of data on a daily basis is more prevelant in some sysadmin jobs than others. In the animation industry, constantly “wrangling” data from one location to the next is required because digital film production requires terabytes upon terabytes of storage. Also, there are different disk I/O requirements based on the quality and resolution of the image being viewed at any given time. If data needs to be “wrangled” to an HD preview room so that it can be inspected during a digital daily, then the “fresh” uncompressed, or slightly compressed, HD image files will need to be moved. Files need to be moved because there are generally two types of storage in animation. There is cheap, large, slow, safe, storage, and there is fast, expensive storage that is oftentimes a JBOD, or “just a bunch of disks,” striped together RAID 0 for speed. A sysadmin in the film industry who primarily deals with data is called a “data wrangler.”
A data wrangler needs to be constantly moving and migrating fresh data from location to location. Often the workhorse of moving data is rsync, scp, cp, or mv. These simple and powerful tools can be scripted with Python to do some incredible things.
Using the standard library, it is possible to do some amazing things without shelling out once. The advantage of using the standard library is that your data moving script will work just about anywhere, without having to depend on a platform-specific version of, say, tar.
Let’s also not forget backups. There are many custom backup scripts and applications that can be written with a trivial amount of Python code. We will caution that writing extra tests for your backup code is not only wise, but necessary. You should make sure you have both unit, and functional testing in place when you are depending on backup scripts you have written yourself.
In addition, it is often necessary to process data at some point before, after, or during a move. Of course, Python is great for this as well. Creating a deduplication tool, a tool that finds duplicate files, and performs actions upon them can be very helpful for this, so we’ll show you how to do it. This is one example of dealing with the constant flow of data that a sysadmin often encounters.
If you have ever struggled with writing cross-platform shell scripts, you will appreciate the fact that the OS module is a portable application programming interface (API) to system services. In Python 2.5, the OS module contains over 200 methods, and many of those methods deal with data. In this section, we will go over many of the methods in that module that systems administrators care about when dealing with data.
Whenever you find yourself needing to explore a new module, IPython is often the right tool for the job, so let’s start our journey through the OS module using IPython to execute a sequence of actions that are fairly commonly encountered. Example 6-1 shows you how to do that.
In [1]: import os In [2]: os.getcwd() Out[2]: '/private/tmp' In [3]: os.mkdir("/tmp/os_mod_explore") In [4]: os.listdir("/tmp/os_mod_explore") Out[4]: [] In [5]: os.mkdir("/tmp/os_mod_explore/test_dir1") In [6]: os.listdir("/tmp/os_mod_explore") Out[6]: ['test_dir1'] In [7]: os.stat("/tmp/os_mod_explore") Out[7]: (16877, 6029306L, 234881026L, 3, 501, 0, 102L, 1207014425, 1207014398, 1207014398) In [8]: os.rename("/tmp/os_mod_explore/test_dir1", "/tmp/os_mod_explore/test_dir1_renamed") In [9]: os.listdir("/tmp/os_mod_explore") Out[9]: ['test_dir1_renamed'] In [10]: os.rmdir("/tmp/os_mod_explore/test_dir1_renamed") In [11]: os.rmdir("/tmp/os_mod_explore/")
As you can see, after we imported the OS module, in line [2] we
get the current working directory, then proceed to make a directory in
line [3]. We then use os.listdir
in
line [4] to list the contents of our newly created directory. Next, we
do an os.stat
, which is very similar
to the stat
command in Bash, and then
rename a directory in line [8]. In line [9], we verify that the
directory was created and then we proceed to delete what we created by
using the os.rmdir
method.
This is by no means an exhaustive look at the OS module. There are methods to do just about anything you would need to do to the data, including changing permissions and creating symbolic links. Please refer to the documentation for the version of Python you are using, or alternately, use IPython with tab completion to view the available methods for the OS module.
Since we talked about data wrangling in the introduction, and you
now also have a bit of an idea about how to use the OS module, we can jump
right into a higher-level module, called shutil
that deals with data on a larger scale. The shutil
module has methods for copying, moving,
renaming, and deleting data just as the OS module does, but it can
perform actions on an entire data tree.
Exploring the shutil
module
with IPython is a fun way to get aquainted with it. In the example
below, we will be using shutil.copytree
, but shutil
has many other methods that do slightly
different things. Please refer to the Python Standard Library documentation to see the differences between shutil copy
methods. See Example 6-2.
In [1]: import os In [2]: os.chdir("/tmp") In [3]: os.makedirs("test/test_subdir1/test_subdir2") In [4]: ls -lR total 0 drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test/ ./test: total 0 drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test_subdir1/ ./test/test_subdir1: total 0 drwxr-xr-x 2 ngift wheel 68 Mar 31 22:27 test_subdir2/ ./test/test_subdir1/test_subdir2: In [5]: import shutil In [6]: shutil.copytree("test", "test-copy") In [19]: ls -lR total 0 drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test/ drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test-copy/ ./test: total 0 drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test_subdir1/ ./test/test_subdir1: total 0 drwxr-xr-x 2 ngift wheel 68 Mar 31 22:27 test_subdir2/ ./test/test_subdir1/test_subdir2: ./test-copy: total 0 drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test_subdir1/ ./test-copy/test_subdir1: total 0 drwxr-xr-x 2 ngift wheel 68 Mar 31 22:27 test_subdir2/ ./test-copy/test_subdir1/test_subdir2:
Obviously, this is quite simple, yet incredibly useful, as you can quite easily wrap this type of code into a more sophisticated cross-platform, data mover script. The immediate use for this kind of code sequence that pops into our heads is to move data from one filesystem to another on an event. In an animation environment, it is often necessary to wait for the latest frames to be finished to convert them into a sequence to edit.
We could write a script to watch a directory for “x” number of frames in a cron job. When that cron job sees that the correct number of frames has been reached, it could then migrate that directory into another directory where the frames could be processed, or even just moved so that they are on a faster disk with I/O quick enough to handle playback of uncompressed HD footage.
The shutil
module doesn’t just
copy files though, it also has methods for moving and deleting trees of data as well. Example 6-3 shows a move of our tree, and Example 6-4 shows how to delete it.
In [20]: shutil.move("test-copy", "test-copy-moved") In [21]: ls -lR total 0 drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test/ drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test-copy-moved/ ./test: total 0 drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test_subdir1/ ./test/test_subdir1: total 0 drwxr-xr-x 2 ngift wheel 68 Mar 31 22:27 test_subdir2/ ./test/test_subdir1/test_subdir2: ./test-copy-moved: total 0 drwxr-xr-x 3 ngift wheel 102 Mar 31 22:27 test_subdir1/ ./test-copy-moved/test_subdir1: total 0 drwxr-xr-x 2 ngift wheel 68 Mar 31 22:27 test_subdir2/ ./test-copy-moved/test_subdir1/test_subdir2:
In [22]: shutil.rmtree("test-copy-moved") In [23]: shutil.rmtree("test-copy") In [24]: ll
Moving a data tree is a bit more exciting than deleting a data tree, as there is nothing to show after a delete. Many of these simple examples could be combined with other actions in more sophisticated scripts. One kind of script that might be useful is to write a backup tool that copies a directory tree to cheap network storage and then creates a datestamped archive. Fortunately, we have an example of doing just that in pure Python in the backup section of this chapter.
One can’t talk about dealing with data without taking into account paths, directories, and files. Every sysadmin needs to be able to, at the very least, write a tool that walks a directory, searches for a condition, and then does something with the result. We are going to cover some interesting ways to do just that.
As always, the Standard Library in Python has some killer tools to get the job done. Python doesn’t have a reputation for being “batteries included” for nothing. Example 6-5 shows how to create an extra verbose directory walking script with functions that explicitly return files, directories, and paths.
import os path = "/tmp" def enumeratepaths(path=path): """Returns the path to all the files in a directory recursively""" path_collection = [] for dirpath, dirnames, filenames in os.walk(path): for file in filenames: fullpath = os.path.join(dirpath, file) path_collection.append(fullpath) return path_collection def enumeratefiles(path=path): """Returns all the files in a directory as a list""" file_collection = [] for dirpath, dirnames, filenames in os.walk(path): for file in filenames: file_collection.append(file) return file_collection def enumeratedir(path=path): """Returns all the directories in a directory as a list""" dir_collection = [] for dirpath, dirnames, filenames in os.walk(path): for dir in dirnames: dir_collection.append(dir) return dir_collection if __name__ == "__main__": print " Recursive listing of all paths in a dir:" for path in enumeratepaths(): print path print " Recursive listing of all files in dir:" for file in enumeratefiles(): print file print " Recursive listing of all dirs in dir:" for dir in enumeratedir(): print dir
On a Mac laptop, the output of this script looks like this:
[ngift@Macintosh-7][H:12022][J:0]# python enumarate_file_dir_path.py Recursive listing of all paths in a dir: /tmp/.aksusb /tmp/ARD_ABJMMRT /tmp/com.hp.launchport /tmp/error.txt /tmp/liten.py /tmp/LitenDeplicationReport.csv /tmp/ngift.liten.log /tmp/hsperfdata_ngift/58920 /tmp/launch-h36okI/Render /tmp/launch-qy1S9C/Listeners /tmp/launch-RTJzTw/:0 /tmp/launchd-150.wDvODl/sock Recursive listing of all files in dir: .aksusb ARD_ABJMMRT com.hp.launchport error.txt liten.py LitenDeplicationReport.csv ngift.liten.log 58920 Render Listeners :0 sock Recursive listing of all dirs in dir: .X11-unix hsperfdata_ngift launch-h36okI launch-qy1S9C launch-RTJzTw launchd-150.wDvODl ssh-YcE2t6PfnO
A note about the previous code
snippet—os.walk
returns a generator object,
so if you call pass a value to
os.walk
, you can walk a tree
yourself:
In [2]: import os In [3]: os.walk("/tmp") Out[3]: [generator object at 0x508e18]
This is what it looks
like when it is run from IPython. You will notice using a generator
gives us the ability to call path.next()
. We won’t
get into the nitty gritty details about generators, but it is important
to know that os.walk
returns a generator object.
Generators are tremendously useful for systems programming. Visit David
Beazely’s website (http://www.dabeaz.com/generators/) to find out all you
need to know about them.
In [2]: import os In [3]: os.walk("/tmp") Out[3]: [generator object at 0x508e18] In [4]: path = os.walk("/tmp") In [5]: path. path.__class__ path.__init__ path.__repr__ path.gi_running path.__delattr__ path.__iter__ path.__setattr__ path.next path.__doc__ path.__new__ path.__str__ path.send path.__getattribute__ path.__reduce__ path.close path.throw path.__hash__ path.__reduce_ex__ path.gi_frame In [5]: path.next() Out[5]: ('/tmp', ['.X11-unix', 'hsperfdata_ngift', 'launch-h36okI', 'launch-qy1S9C', 'launch-RTJzTw', 'launchd-150.wDvODl', 'ssh-YcE2t6PfnO'], ['.aksusb', 'ARD_ABJMMRT', 'com.hp.launchport', 'error.txt', 'liten.py', 'LitenDeplicationReport.csv', 'ngift.liten.log'])
In a bit, we will look at generators in more detail, but let’s first make a cleaner module that gives us files, directories, and paths in a clean API.
Now that we have walked a very basic directory, let’s make this an object-oriented module so that we can easily import and reuse it again. It will take a small amount of work to make a hardcoded module, but a generic module that we can reuse later will certainly make our lives easier. See Example 6-6.
import os class diskwalk(object): """API for getting directory walking collections""" def __init__(self, path): self.path = path def enumeratePaths(self): """Returns the path to all the files in a directory as a list""" path_collection = [] for dirpath, dirnames, filenames in os.walk(self.path): for file in filenames: fullpath = os.path.join(dirpath, file) path_collection.append(fullpath) return path_collection def enumerateFiles(self): """Returns all the files in a directory as a list""" file_collection = [] for dirpath, dirnames, filenames in os.walk(self.path): for file in filenames: file_collection.append(file) return file_collection def enumerateDir(self): """Returns all the directories in a directory as a list""" dir_collection = [] for dirpath, dirnames, filenames in os.walk(self.path): for dir in dirnames: dir_collection.append(dir) return dir_collection
As you can see, with a few small modifications, we were able to make a very nice interface for future modifications. One of the nice things about this new module is that we can import it into another script.
Comparing data is quite important to a sysadmin. Questions you might often ask yourself are, “What files are different between these two directories? How many copies of this same file exist on my system?” In this section, you will find the ways to answer those questions and more.
When dealing with massive quantities of important data, it often is necessary to compare directory trees and files to see what changes have been made. This becomes critical if you start writing large data mover scripts. The absolute doomsday scenario is to write a large data move script that damages critical production data.
In this section, we will first explore a few lightweight methods
to compare files and directories and then move on to eventually
exploring doing checksum comparisons of files. The Python Standard
Library has several modules that assist with comparisons and we will be
covering filecmp
and os.listdir
.
The filecmp
module contains functions for doing
fast and efficient comparisons of files and directories. The
filecmp
module will perform a
os.stat
on two files and return a True if the
results of os.stat
are the same for both files or
a False if the results are not. Typically,
os.stat
is used to determine whether or not two
files use the same inodes on a disk and whether they are the same
size, but it does not actually compare the contents.
In order to fully understand how filecmp
works, we
need to create three files from scratch. To do this on computer,
change into the /tmp directory,
make a file called file0.txt, and
place a “0” in the file. Next, create a file called file1.txt, and place a “1” in that file.
Finally, create a file called file00.txt, and place a “0” in it. We will
use these files as examples in the following code:
In [1]: import filecmp In [2]: filecmp.cmp("file0.txt", "file1.txt") Out[2]: False In [3]: filecmp.cmp("file0.txt", "file00.txt") Out[3]: True
As you can see, the cmp
function returned True
in the case
of file0.txt and file00.txt, and False
when file1.txt was compared with file0.txt.
The dircmp
function has a number of
attributes that report differences between directory trees. We won’t go over every attribute, but
we have created a few examples of useful things you can do. For this
example, we created two subdirectories in the /tmp directory and copied the files from
our previous example into each directory. In dirB, we created one extra file named
file11.txt, into which we put
“11”:
In [1]: import filecmp In [2]: pwd Out[2]: '/private/tmp' In [3]: filecmp.dircmp("dirA", "dirB").diff_files Out[3]: [] In [4]: filecmp.dircmp("dirA", "dirB").same_files Out[4]: ['file1.txt', 'file00.txt', 'file0.txt'] In [5]: filecmp.dircmp("dirA", "dirB").report() diff dirA dirB Only in dirB : ['file11.txt'] Identical files : ['file0.txt', 'file00.txt', 'file1.txt']
You might be a bit surprised to see here that there were no
matches for diff_files
even though we created a
file11.txt that has unique
information in it. The reason is that diff_files
compares only the differences between files with the same name.
Next, look at the output of same_files
, and
notice that it only reports back files that are identical in two
directories. Finally, we can generate a report as shown in the last
example. It has a handy output that includes a breakdown of the
differences between the two directories. This brief overview is just a
bit of what the filecmp
module can do, so we
recommend taking a look at the Python Standard Library documentation
to get a full overview of the features we did not have space to
cover.
Another lightweight method of comparing directories is to use
os.listdir
. You can think of os.listdir
as an
ls
command that returns a Python
list of the files found. Because Python supports many interesting
ways to deal with lists, you can use os.listdir
to determine differences in a
directory yourself, quite simply by converting your list into a set
and then subtracting one set from another. Here is an example of
what this looks like in IPython:
In [1]: import os In [2]: dirA = set(os.listdir("/tmp/dirA")) In [3]: dirA Out[3]: set(['file1.txt', 'file00.txt', 'file0.txt']) In [4]: dirB = set(os.listdir("/tmp/dirB")) In [5]: dirB Out[5]: set(['file1.txt', 'file00.txt', 'file11.txt', 'file0.txt']) In [6]: dirA - dirB Out[6]: set([]) In [7]: dirB-dirA Out[7]: set(['file11.txt'])
From this example, you can see that we used a neat trick of
converting two lists into sets and then subtracting the sets to find
the differences. Notice that line [7] returns file11.txt
because dirB
is a superset of dirA
, but in line [6] the results are
empty because dirA
contains all
of the same items as dirB
. Using
sets makes it easy to create a simple merge of two data structures
as well, by subtracting the full paths of one directory against
another and then copying the difference. We will discuss merging
data in the next section.
This approach has some very large limitations though. The actual name of a file is often misleading, as it is possible to have a file that is 0k that has the same name as a file with 200 GBs. In the next section, we cover a better approach to finding the differences between two directories and merging the contents together.
What can you do when you don’t want to simply compare data files, but you would like to merge two directory trees together? A problem often can occur when you want to merge the contents of one tree into another without creating any duplicates.
You could just blindly copy the files from one directory into your
target directory, and then deduplicate the directory, but it would be
more efficient to prevent the duplicates in the first place. One
reasonably simple solution would be to use the filecmp
module’s dircmp
function to compare two directories, and then copy the unique results
using the os.listdir
technique described earlier. A
better choice would be to use MD5 checksums, which we explain in the
next section.
Performing a MD5 checksum on a file and comparing it to another file is like going target shooting with a bazooka. It is the big weapon you pull out when you want to be sure of what you are doing, although a byte-by-byte comparison is truly 100 percent accurate. Example 6-7 shows how the function takes in a path to a file and returns a checksum.
import hashlib def create_checksum(path): """ Reads in file. Creates checksum of file line by line. Returns complete checksum total for file. """ fp = open(path) checksum = hashlib.md5() while True: buffer = fp.read(8192) if not buffer:break checksum.update(buffer) fp.close() checksum = checksum.digest() return checksum
Here is an iterative example that uses this function with IPython to compare two files:
In [2]: from checksum import createChecksum In [3]: if createChecksum("image1") == createChecksum("image2"): ...: print "True" ...: ...: True In [5]: if createChecksum("image1") == createChecksum("image_unique"): print "True" ...: ...:
In that example, the checksums of the files were manually compared, but we can use the code we wrote earlier that returns a list of paths to recursively compare a directory tree full of files and gives us duplicates. One of the other nice things about creating a reasonable API is that we can now use IPython to interactively test our solution. Then, if it works, we can create another module. Example 6-8 shows the code for finding the duplicates.
In [1]: from checksum import createChecksum In [2]: from diskwalk_api import diskwalk In [3]: d = diskwalk('/tmp/duplicates_directory') In [4]: files = d.enumeratePaths() In [5]: len(files) Out[5]: 12 In [6]: dup = [] In [7]: record = {} In [8]: for file in files: compound_key = (getsize(file),create_checksum(file)) if compound_key in record: dup.append(file) else: record[compound_key] = file ....: ....: In [9]: print dup ['/tmp/duplicates_directory/image2']
The only portion of this code that we haven’t looked at in previous examples is found on line [8]. We create an empty dictionary and then use a key to store the checksum we generate. This can serve as a simple way to determine whether or not that checksum has been seen before. If it has, then we toss the file into a dup list. Now, let’s separate this into a piece of code we can use again. After all that is quite useful. Example 6-9 shows how to do that.
from checksum import create_checksum from diskwalk_api import diskwalk from os.path import getsize def findDupes(path = '/tmp'): dup = [] record = {} d = diskwalk(path) files = d.enumeratePaths() for file in files: compound_key = (getsize(file),create_checksum(file)) if compound_key in record: dup.append(file) else: #print "Creating compound key record:", compound_key record[compound_key] = file return dup if __name__ == "__main__": dupes = findDupes() for dup in dupes: print “Duplicate: %s” % dup
When we run that script, we get the following output:
[ngift@Macintosh-7][H:10157][J:0]# python find_dupes.py Duplicate: /tmp/duplicates_directory/image2
We hope you can see that this shows what even a little bit of code reuse can accomplish. We now have a generic module that will take a directory tree and return a list of all the duplicate files. This is quite handy in and of itself, but next we can take this one step further and automatically delete the duplicates.
Deleting files in Python is simple, as you can use
os.remove
(file). In this example, we have a number of 10 MB files in our
/tmp directory; let’s try to
delete one of them using os.remove
(file):
In [1]: import os In [2]: os.remove("10 10mbfile.0 10mbfile.1 10mbfile.2 10mbfile.3 10mbfile.4 10mbfile.5 10mbfile.6 10mbfile.7 10mbfile.8 In [2]: os.remove("10mbfile.1") In [3]: os.remove("10 10mbfile.0 10mbfile.2 10mbfile.3 10mbfile.4 10mbfile.5 10mbfile.6 10mbfile.7 10mbfile.8
Notice that tab completion
in IPython allows us to see the matches and fills out the names of the
image files for us. Be aware that the os.remove
(file) method is silent and permanent, so this might or might not be
what you want to do. With that in mind, we can implement an easy
method to delete our duplicates, and then enhance it after the fact.
Because it is so easy to test interactive code with IPython, we are
going to write a test function on the fly and try it:
In [1]: from find_dupes import findDupes In [2]: dupes = findDupes("/tmp") In [3]: def delete(file): import os ...: print "deleting %s" % file ...: os.remove(file) ...: ...: In [4]: for dupe in dupes: ...: delete(dupe) ...: ...: In [5]: for dupe in dupes: delete(dupe) ...: ...: deleting /tmp/10mbfile.2 deleting /tmp/10mbfile.3 deleting /tmp/10mbfile.4 deleting /tmp/10mbfile.5 deleting /tmp/10mbfile.6 deleting /tmp/10mbfile.7 deleting /tmp/10mbfile.8
In this example, we added some
complexity to our delete method by including a print statement of the
files we automatically deleted. Just because we created a whole slew
of reusable code, it doesn’t mean we need to stop now. We can create
another module that does fancy delete-related things when it is a
file
object. The module doesn’t
even need to be tied to duplicates, it can be used to delete anything.
See Example 6-10.
#!/usr/bin/env python import os class Delete(object): """Delete Methods For File Objects""" def __init__(self, file): self.file = file def interactive(self): """interactive deletion mode""" input = raw_input("Do you really want to delete %s [N]/Y" % self.file) if input.upper(): print "DELETING: %s" % self.file status = os.remove(self.file) else: print "Skipping: %s" % self.file return def dryrun(self): """simulation mode for deletion""" print "Dry Run: %s [NOT DELETED]" % self.file return def delete(self): """Performs a delete on a file, with additional conditions """ print "DELETING: %s" % self.file try: status = os.remove(self.file) except Exception, err: print err return status if __name__ == "__main__": from find_dupes import findDupes dupes = findDupes('/tmp') for dupe in dupes: delete = Delete(dupe) #delete.dryrun() #delete.delete() #delete.interactive()
In this module, you will see three different types of deletes. The interactive delete method prompts the user to confirm each file he is going to delete. This can seem a bit annoying, but it is good protection when other programmers will be maintaining and updating the code.
The dry run method simulates a deletion. And, finally, there is an actual delete method that will permanently delete your files. At the bottom of the module, you can see that there is a commented example of the ways to use each of these three different methods. Here is an example of each method in action:
Dry run
ngift@Macintosh-7][H:10197][J:0]# python delete.py Dry Run: /tmp/10mbfile.1 [NOT DELETED] Dry Run: /tmp/10mbfile.2 [NOT DELETED] Dry Run: /tmp/10mbfile.3 [NOT DELETED] Dry Run: /tmp/10mbfile.4 [NOT DELETED] Dry Run: /tmp/10mbfile.5 [NOT DELETED] Dry Run: /tmp/10mbfile.6 [NOT DELETED] Dry Run: /tmp/10mbfile.7 [NOT DELETED] Dry Run: /tmp/10mbfile.8 [NOT DELETED]
Interactive
ngift@Macintosh-7][H:10201][J:0]# python delete.py Do you really want to delete /tmp/10mbfile.1 [N]/YY DELETING: /tmp/10mbfile.1 Do you really want to delete /tmp/10mbfile.2 [N]/Y Skipping: /tmp/10mbfile.2 Do you really want to delete /tmp/10mbfile.3 [N]/Y
Delete
[ngift@Macintosh-7][H:10203][J:0]# python delete.py DELETING: /tmp/10mbfile.1 DELETING: /tmp/10mbfile.2 DELETING: /tmp/10mbfile.3 DELETING: /tmp/10mbfile.4 DELETING: /tmp/10mbfile.5 DELETING: /tmp/10mbfile.6 DELETING: /tmp/10mbfile.7 DELETING: /tmp/10mbfile.8
You might find using encapsulation techniques like this very handy when dealing with data because you can prevent a future problem by abstracting what you are working on enough to make it nonspecific to your problem. In this situation, we wanted to automatically delete duplicate files, so we created a module that generically finds filenames and deletes them. We could make another tool that generically takes file objects and applies some form of compression as well. We are actually going to get to that example in just a bit.
So far you have seen how to process directories and files, and perform actions such as finding duplicates, deleting directories, moving directories, and so on. The next step in mastering the directory tree is to use pattern matching, either alone or in combination with these previous techniques. As just about everything else in Python, performing a pattern match for a file extension or filename is simple. In this section, we will demonstrate a few common pattern matching problems and apply the techniques used earlier to create simple, yet powerful reusable tools.
A fairly common problem sysadmins need to solve is that they need
to track down and delete, move, rename, or copy a certain file type. The
most straightforward approach to doing this in Python is
to use either the fnmatch
module or the
glob
module. The main difference between these two
modules is that fnmatch
returns a True or False for
a Unix wildcard, and glob
returns a list of
pathnames that match a pattern. Alternatively, regular expressions can
be used to create more sophisticated pattern matching tools. Please
refer to Chapter 3 to get more detailed instructions on
using regular expressions to match patterns.
Example 6-11 will look at how
fnmatch
and glob
can be used.
We will reuse the code we’ve been working on by importing diskwalk from
the diskwalk_api
module.
In [1]: from diskwalk_api import diskwalk In [2]: files = diskwalk("/tmp") In [3]: from fnmatch import fnmatch In [4]: for file in files: ...: if fnmatch(file,"*.txt"): ...: print file ...: ...: /tmp/file.txt In [5]: from glob import glob In [6]: import os In [7]: os.chdir("/tmp") In [8]: glob("*") Out[8]: ['file.txt', 'image.iso', 'music.mp3']
In the previous example, after we reused our previous diskwalk
module, we received a list of all of
the full paths located in the /tmp
directory. We then used fnmatch
to determine
whether each file matched the pattern “*.txt”. The
glob
module is a bit different, in that it will
literally “glob,” or match a pattern, and return the full path.
Glob
is a much higher-level function than
fnmatch
, but both are very useful tools for
slightly different jobs.
The fnmatch
function is particularly useful
when it is combined with other code to create a filter to search for
data in a directory tree. Often, when dealing with directory trees, you
will want to work with files that match certain patterns. To see this in
action, we will solve a classic sysadmin problem by renaming all of
the files that match a pattern in a directory tree. Keep
in mind that it is just as simple to rename files as it is to delete,
compress, or process them. There is a simple pattern here:
Get the path to a file in a directory.
Perform some optional layer of filtering; this could involve many filters, such as filename, extension, size, uniqueness, and so on.
Perform an action on them; this could be copying, deleting, compressing, reading, and so on. Example 6-12 shows how to do this.
In [1]: from diskwalk_api import diskwalk In [2]: from shutil import move In [3]: from fnmatch import fnmatch In [4]: files = diskwalk("/tmp") In [5]: for file in files: if fnmatch(file, "*.mp3"): #here we can do anything we want, delete, move, rename...hmmm rename move(file, "%s.txt" % file) In [6]: ls -l /tmp/ total 0 -rw-r--r-- 1 ngift wheel 0 Apr 1 21:50 file.txt -rw-r--r-- 1 ngift wheel 0 Apr 1 21:50 image.iso -rw-r--r-- 1 ngift wheel 0 Apr 1 21:50 music.mp3.txt -rw-r--r-- 1 ngift wheel 0 Apr 1 22:45 music1.mp3.txt -rw-r--r-- 1 ngift wheel 0 Apr 1 22:45 music2.mp3.txt -rw-r--r-- 1 ngift wheel 0 Apr 1 22:45 music3.mp3.txt
Using code we already wrote, we used four lines of very readable Python code to rename a tree full of mp2 files to text files. If you are one of the few sysadmins who has not read at least one episode of BOFH, or Bastard Operator From Hell, it might not be immediately obvious what we could do next with our bit of code.
Imagine you have a production file server that is strictly for high-performance disk I/O storage, and it has a limited capacity. You have noticed that it often gets full because one or two abusers place hundreds of GBs of MP3 files on it. You could put a quota on the amount of file space each user can access, of course, but often that is more trouble than it is worth. One solution would be to create a cron job every night that finds these MP3 files, and does “random” things to them. On Monday it could rename them to text files, on Tuesday it could compress them into ZIP files, on Wednesday it could move them all into the /tmp directory, and on Thursday it could delete them, and send the owner of the file an emailed list of all the MP3 files it deleted. We would not suggest doing this unless you own the company you work for, but for the right BOFH, the earlier code example is a dream come true.
As you might well already know, rsync
is a command-line tool that was
originally written by Andrew Tridgell and Paul Mackerra. Late in 2007,
rsync version 3 was released for testing and it includes an even greater
assortment of options than the original version.
Over the years, we have found ourselves using rsync as the primary tool to move data from point A to point B. The manpage and options are staggering works, so we recommend that you read through them in detail. Rsync may just be the single most useful command-line tool ever written for systems administrators.
With that being said, there are some ways that Python can help control, or glue rsync’s behavior. One problem that we have encountered is ensuring that data gets copied at a scheduled time. We have been in many situations in which we needed to synchronize TBs of data as quickly as possible between one file server and another, but we did not want to monitor the process manually. This is a situation in which Python really makes a lot of sense.
With Python you can add a degree of artificial intelligence to rsync and customize it to your particular needs. The point of using Python as glue code is that you make Unix utilities do things they were never intended to do, and so you make highly flexible and customizable tools. The limit is truly only your imagination. Example 6-13 shows a very simple example of how to wrap rsync.
#!/usr/bin/env python #wraps up rsync to synchronize two directories from subprocess import call import sys source = "/tmp/sync_dir_A/" #Note the trailing slash target = "/tmp/sync_dir_B" rsync = "rsync" arguments = "-a" cmd = "%s %s %s %s" % (rsync, arguments, source, target) def sync(): ret = call(cmd, shell=True) if ret !=0: print "rsync failed" sys.exit(1) sync()
This example is hardcoded to synchronize two directories and to
print out a failure message if the command does not work. We could do
something a bit more interesting, though, and solve a problem that we
have frequently run into. We have often found that we are called upon to
synchronize two very large directories, and we don’t want to monitor
data synchronization overnight. But if you don’t monitor the
synchronization, you can find that it disrupted partway through the
process, and quite often the data, along with a whole night, is wasted,
and the process needs to start again the next day. Using Python, you can
create a more aggressive, highly motivated rsync
command.
What would a highly motivated rsync command do exactly? Well, it would do what you would do if you were monitoring the synchronization of two directories: it would continue trying to synchronize the directories until it finished, and then it would send an email saying it was done. Example 6-14 shows the rsync code of our little over achiever in action.
#!/usr/bin/env python #wraps up rsync to synchronize two directories from subprocess import call import sys import time """this motivated rsync tries to synchronize forever""" source = "/tmp/sync_dir_A/" #Note the trailing slash target = "/tmp/sync_dir_B" rsync = "rsync" arguments = "-av" cmd = "%s %s %s %s" % (rsync, arguments, source, target) def sync(): while True: ret = call(cmd, shell=True) if ret !=0: print "resubmitting rsync" time.sleep(30) else: print "rsync was succesful" subprocess.call("mail -s 'jobs done' [email protected]", shell=True) sys.exit(0) sync() </literallayout> </example>
This is overly simplified and contains hardcoded data, but it is an example of the kind of useful tool you can develop to automate something you normally need to monitor manually. There are some other features you can include, such as the ability to set the retry interval and limit as well as the ability to check for disk usage on the machine to which you are connecting and so on.
Most systems administrators get to the point where they start to be concerned, not just about data, but about the data about the data. Metadata, or data about data, can often be more important than the data itself. To give an example, in film and television, the same data often exists in multiple locations on a filesystem or even on several filesystems. Keeping track of the data often involves creating some type of metadata management system.
It is the data about how those files are organized and used, though, that can be the most critical to an application, to an animation pipeline, or to restore a backup. Python can help here, too, as it is easy to both use metadata and write metadata with Python.
Let’s look at using a popular ORM, SQLAlchemy, to create some metadata about a filesystem. Fortunately, the documentation for SQLAlchemy is very good, and SQLAlchemy works with SQLite. We think this is a killer combination for creating custom metadata solutions.
In the examples above, we walked a filesystem in real time and performed actions and queries on paths that we found. While this is incredibly useful, it is also time-consuming to search a large filesystem consisting of millions of files to do just one thing. In Example 6-15, we show what a very basic metadata system could look like by combining the directory walking techniques we have just mastered with an ORM.
#!/usr/bin/env python from sqlalchemy import create_engine from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey from sqlalchemy.orm import mapper, sessionmaker import os #path path = " /tmp" #Part 1: create engine engine = create_engine('sqlite:///:memory:', echo=False) #Part 2: metadata metadata = MetaData() filesystem_table = Table('filesystem', metadata, Column('id', Integer, primary_key=True), Column('path', String(500)), Column('file', String(255)), ) metadata.create_all(engine) #Part 3: mapped class class Filesystem(object): def __init__(self, path, file): self.path = path self.file = file def __repr__(self): return "[Filesystem('%s','%s')]" % (self.path, self.file) #Part 4: mapper function mapper(Filesystem,filesystem_table) #Part 5: create session Session = sessionmaker(bind=engine, autoflush=True, transactional=True) session = Session() #Part 6: crawl file system and populate database with results for dirpath, dirnames, filenames in os.walk(path): for file in filenames: fullpath = os.path.join(dirpath, file) record = Filesystem(fullpath, file) session.save(record) #Part 7: commit to the database session.commit() #Part 8: query for record in session.query(Filesystem): print "Database Record Number: %s, Path: %s , File: %s " % (record.id,record.path, record.file)
It would be best to think about this code as a set of procedures
that are followed one after another. In part one, we create an engine,
which is really just a fancy way of defining the database we are going
to use. In part two, we define a metadata instance, and create our
database tables. In part three, we create a class that will map to the
tables in the database that we created. In part four, we call a mapper
function that puts the ORM; it actually maps this class to the tables.
In part five, we create a session to our database. Notice that there are
a few keyword parameters that we set, including autoflush
and
transactional
.
Now that we have the very explicit ORM setup completed, in part six, we do our usual song and dance, and grab the filenames and complete paths while we walk a directory tree. There are a couple of twists this time, though. Notice that we create a record in the database for each fullpath and file we encounter, and that we then save each newly created record as it is created. We then commit this transaction to our “in memory” SQLite database in part seven.
Finally, in part eight, we perform a query, in Python, of course, that returns the results of the records we placed in the database. This example could potentially be a fun way to experiment with creating custom SQLAlchemy metadata solutions for your company or clients. You could expand this code to do something interesting, such as perform relational queries or write results out to a file, and so on.
Dealing with data in big chunks is a problem that sysadmins have to face every day. They often use tar, dd, gzip, bzip, bzip2, hdiutil, asr, and other utilities to get their jobs done.
Believe it or not, the “batteries included” Python Standard Library has built-in support for TAR files, zlib files, and gzip files. If compression and archiving is your goal, then you will not have any problem with the rich tools Python has to offer. Let’s look at the grandaddy of all archive packages: tar; and we we’ll see how the standard library implements tar.
Creating a TAR archive is quite easy, almost too easy in fact. In
Example 6-16, we create a very large file as an
example. Note, the syntax is much more user friendly than even the
tar
command itself.
In [1]: f = open("largeFile.txt", "w") In [2]: statement = "This is a big line that I intend to write over and over again." ln [3]: x = 0 In [4]: for x in xrange(20000): ....: x += 1 ....: f.write("%s " % statement) ....: ....: In [4]: ls -l -rw-r--r-- 1 root root 1236992 Oct 25 23:13 largeFile.txt
OK, now that we have a big file full of junk, let’s TAR that baby up. See Example 6-17.
In [1]: import tarfile In [2]: tar = tarfile.open("largefile.tar", "w") In [3]: tar.add("largeFile.txt") In [4]: tar.close() In [5]: ll -rw-r--r-- 1 root root 1236992 Oct 25 23:15 largeFile.txt -rw-r--r-- 1 root root 1236992 Oct 26 00:39 largefile.tar
So, as you can see, this makes a vanilla TAR archive in a much
easier syntax than the regular tar
command. This certainly makes the case for using the IPython shell to do
all of your daily systems administration work.
While it is handy to be able to create a TAR file using Python, it is almost useless to TAR up only one file. Using the same directory walking pattern we have used numerous times in this chapter, we can create a TAR file of the whole /tmp directory by walking the tree and then adding each file to the contents of the /tmp directory TAR. See Example 6-18.
In [27]: import tarfile In [28]: tar = tarfile.open("temp.tar", "w") In [29]: import os In [30]: for root, dir, files in os.walk("/tmp"): ....: for file in filenames: ....: KeyboardInterrupt In [30]: for root, dir, files in os.walk("/tmp"): for file in files: ....: fullpath = os.path.join(root,file) ....: tar.add(fullpath) ....: ....: In [33]: tar.close()
It is quite simple to add the contents of a directory tree by walking a directory, and it is a good pattern to use, because it can be combined with some of the other techniques we have covered in this chapter. Perhaps you are archiving a directory full of media files. It seems silly to archive exact duplicates, so perhaps you want to replace duplicates with symbolic links before you create a TAR file. With the information in this chapter, you can easily build the code that will do just that and save quite a bit of space.
Since doing a generic TAR archive is a little bit boring, let’s spice it up a bit and add bzip2 compression, which will make your CPU whine and complain at how much you are making it work. The bzip2 compression algorithm can do some really funky stuff. Let’s look at an example of how impressive it can truly be.
Then get real funky and make a 60 MB text file shrink down to 10 K! See Example 6-19.
In [1: tar = tarfile.open("largefilecompressed.tar.bzip2", "w|bz2") In [2]: tar.add("largeFile.txt") In [3]: ls -h foo1.txt fooDir1/ largeFile.txt largefilecompressed.tar.bzip2* foo2.txt fooDir2/ largefile.tar ln [4]: tar.close() In [5]: ls -lh -rw-r--r-- 1 root root 61M Oct 25 23:15 largeFile.txt -rw-r--r-- 1 root root 61M Oct 26 00:39 largefile.tar -rwxr-xr-x 1 root root 10K Oct 26 01:02 largefilecompressed.tar.bzip2*
What is amazing is that bzip2 was able to compress our 61 M text file into 10 K, although we did cheat quite a bit using the same data over and over again. This didn’t come at zero cost of course, as it took a few minutes to compress this file on a dual core AMD system.
Let’s go the whole nine yards and do a compressed archive with the rest of the available options, starting with gzip next. The syntax is only slightly different. See Example 6-20.
In [10]: tar = tarfile.open("largefile.tar.gzip", "w|gz") In [11]: tar.add("largeFile.txt") ln [12]: tar.close() In [13]: ls -lh -rw-r--r-- 1 root root 61M Oct 26 01:20 largeFile.txt -rw-r--r-- 1 root root 61M Oct 26 00:39 largefile.tar -rwxr-xr-x 1 root root 160K Oct 26 01:24 largefile.tar.gzip*
A gzip archive is still incredibly small, coming in at 160 K, but on my machine it was able to create this compressed TAR file in seconds. This seems like a good trade-off in most situations.
Now that we have a tool that creates TAR files, it only makes sense to examine the TAR file’s contents. It is one thing to blindly create a TAR file, but if you have been a systems administrator for any length of time, you have probably gotten burned by a bad backup, or have been accused of making a bad backup.
To put this situation in perspective and highlight the importance of examining TAR archives, we will share a story about a fictional friend of ours, let’s call it The Case of the Missing TAR Archive. Names, identities, and facts, are fictional; if this story resembles reality, it is completely coincidental.
OUr friend worked at a major television studio as a systems administrator and was responsible for supporting a department led by a real crazy man. This man had a reputation for not telling the truth, acting impulsively, and well, being crazy. If a situation arose where the crazy man was at fault, like he missed a deadline with a client, or didn’t produce a segment according to the specifications he was given, he would gladly just lie and blame it on someone else. Often times, that someone else was our friend, the systems administrator.
Unfortunately, our friend, was responsible for maintaining this lunatic’s backups. His first thought was it was time to look for a new job, but he had worked at this studio for many years, and had many friends, and didn’t want to waste all that on this temporarily bad situation. He needed to make sure he covered all of his bases and so instituted a logging system that categorized the contents of all of the automated TAR archives that were created for the crazy man, as he felt it was only a matter of time before he would get burned when the crazy man missed a deadline, and needed an excuse.
One day our friend, William, gets a call from his boss, “William I need to see you in my office immediately, we have a situation with the backups.” William, immediately walked over to his office, and was told that the crazy man, Alex, had accused William of damaging the archive to his show, and this caused him to miss a deadline with his client. When Alex missed deadlines with his client, it made Alex’s boss Bob, very upset.
William was told by his boss that Alex had told him the backup contained nothing but empty, damaged files, and that he had been depending on that archive to work on his show. William then told his boss, he was certain that he would eventually be accused of messing up an archive, and had secretly written some Python code that inspected the contents of all the TAR archives he had made and created extended information about the attributes of the files before and after they were backed up. It turned out that Alex had never created a show to begin with and that there was an empty folder being archived for months.
When Alex was confronted with this information, he quickly backpeddled and looked for some way to shift attention onto a new issue. Unfortunately for Alex, this was the last straw and a couple of months later, he never showed up to work. He may have either left or been fired, but it didn’t matter, our friend, William had solved, The Case of the Missing TAR Archive.
The moral of this story is that when you are dealing with backups, treat them like nuclear weapons, as backups are fraught with danger in ways you might not even imagine.
Here are several methods to examine the contents of the TAR file we created earlier:
In [1]: import tarfile In [2]: tar = tarfile.open("temp.tar","r") In [3]: tar.list() -rw-r--r-- ngift/wheel 2 2008-04-04 15:17:14 tmp/file00.txt -rw-r--r-- ngift/wheel 2 2008-04-04 15:15:39 tmp/file1.txt -rw-r--r-- ngift/wheel 0 2008-04-04 20:50:57 tmp/temp.tar -rw-r--r-- ngift/wheel 2 2008-04-04 16:19:07 tmp/dirA/file0.txt -rw-r--r-- ngift/wheel 2 2008-04-04 16:19:07 tmp/dirA/file00.txt -rw-r--r-- ngift/wheel 2 2008-04-04 16:19:07 tmp/dirA/file1.txt -rw-r--r-- ngift/wheel 2 2008-04-04 16:19:52 tmp/dirB/file0.txt -rw-r--r-- ngift/wheel 2 2008-04-04 16:19:52 tmp/dirB/file00.txt -rw-r--r-- ngift/wheel 2 2008-04-04 16:19:52 tmp/dirB/file1.txt -rw-r--r-- ngift/wheel 3 2008-04-04 16:21:50 tmp/dirB/file11.txt In [4]: tar.name Out[4]: '/private/tmp/temp.tar' In [5]: tar.getnames() Out[5]: ['tmp/file00.txt', 'tmp/file1.txt', 'tmp/temp.tar', 'tmp/dirA/file0.txt', 'tmp/dirA/file00.txt', 'tmp/dirA/file1.txt', 'tmp/dirB/file0.txt', 'tmp/dirB/file00.txt', 'tmp/dirB/file1.txt', 'tmp/dirB/file11.txt'] In [10]: tar.members Out[10]: [<TarInfo 'tmp/file00.txt' at 0x109eff0>, <TarInfo 'tmp/file1.txt' at 0x109ef30>, <TarInfo 'tmp/temp.tar' at 0x10a4310>, <TarInfo 'tmp/dirA/file0.txt' at 0x10a4350>, <TarInfo 'tmp/dirA/file00.txt' at 0x10a43b0>, <TarInfo 'tmp/dirA/file1.txt' at 0x10a4410>, <TarInfo 'tmp/dirB/file0.txt' at 0x10a4470>, <TarInfo 'tmp/dirB/file00.txt' at 0x10a44d0>, <TarInfo 'tmp/dirB/file1.txt' at 0x10a4530>, <TarInfo 'tmp/dirB/file11.txt' at 0x10a4590>]
Those examples show how to examine the names of the files in the TAR archive, which could be validated in data verification script. Extracting files is not much more work. If you want to extract a whole TAR archive to the current working directory, you can simply use the following:
In [60]: tar.extractall() drwxrwxrwx 7 ngift wheel 238 Apr 4 22:59 tmp/
If you are extremely paranoid, and you should be, then you could also include a step that extracts the contents of the archives and performs random MD5 checksums on files from the archive and compare them against MD5 checksums you made on the file before it was backed up. This could be a very effective way to monitor whether the integrity of the data is what you expect it to be.
No sane archiving solution should just trust that an archive was created properly. At the very least, random spot checking of archives needs to be done automatically. At best, every single archive should be reopened and checked for validity after it has been created.
13.58.44.229