Chapter 3. Interacting with the world

 

This chapter covers

  • What libraries are
  • How to use libraries, including Python’s standard library
  • An example program that uses Python’s os and sys libraries
  • Python’s dictionary data type

 

One of the key strengths of Python is its standard library. Installed along with Python, the standard library is a large suite of program code that covers common tasks like finding and iterating over files, handling user input, downloading and parsing pages from the web, and accessing databases. If you make good use of the standard library, you can often write programs in a fraction of the time that it would take you otherwise, with less typing and far fewer bugs.

 

Guido’s Time Machine

The standard library is so extensive that one of the running jokes in the Python community is that Guido (the inventor of Python) owns a time machine. When someone asks for a module that performs a particular task, Guido hops in his time machine, travels back to the beginning of Python, and— “poof!”—it’s already there.

 

In chapter 2, you used the choice function in Python’s random module to pick something from a list, so you’ve already used a library. In this chapter, we’ll go in depth and find out more about how to use libraries, what other libraries exist, and how to use Python’s documentation to learn about specific libraries. In the process, you’ll also pick up a few other missing pieces of Python, such as how you can read files, and you’ll discover another of Python’s data types—the dictionary.

The program in this chapter solves a common problem that you’ve probably faced before: you have two similar folders (perhaps one’s a backup of your holiday photos), and you’d like to know which files differ between the two of them. You’ll be tackling this program from a different angle than in chapter 2, though. Rather than write most of your own code, you’ll be using Python to glue together several standard libraries to get the job done.

Let’s start by learning more about Python libraries.

“Batteries included”: Python’s libraries

What are libraries used for? Normally, they’re geared toward a single purpose, such as sending data via a network, writing CSV or Excel files, displaying graphics, or handling user input. But libraries can grow to cover a large number of related functions; there’s no hard or fast rule.

 

Library

Program code that is written so that it can be used by other programs.

 

Python libraries can do anything that Python can, and more. In some (rare) cases, like intensive number crunching or graphics processing, Python can be too slow to do what you need; but it’s possible to extend Python to use libraries written in C.

In this section, you’ll learn about Python’s standard library, see which other libraries you can add, try them out, and get a handle on exploring a single library.

Python’s standard library

Python installs with a large number of libraries that cover most of the common tasks that you’ll need to handle when programming.

If you find yourself facing a tricky problem, it’s a good habit to read through the modules in Python’s standard library to see if something covers what you need to do. The Python manuals are installed with the standard Windows installer, and there’s normally a documentation package when installing under Linux. The latest versions are also available at http://docs.python.org if you’re connected to the internet. Being able to use a good library can save you hours of programming, so 5 or 10 minutes up front can pay big dividends.

The Python standard library is large enough that it can be hard to find what you need. Another way to learn it is to take it one piece at a time. The Python Module of the Week blog (www.doughellmann.com/PyMOTW/) covers most of Python’s standard library and is an excellent way to familiarize yourself with what’s available, because it often contains far more explanation than the standard Python documentation.

Other libraries

You’re not limited to the libraries that Python installs. It’s easy to download and install extra libraries to add the additional functionality that you need. Most add-on libraries come with their own installers or installation script; those that don’t can normally be copied into the library folder of your Python directory. You’ll find out how to install libraries in later chapters, Once the extra libraries are installed, they behave like Python’s built-in ones; there’s no special syntax that you need to know.

Using libraries

Once installed, using a library is straightforward: just add an import line at the top of the script. There are several ways to do it, but here are the three most common.

Include Everything

You can include everything from a library into your script by using a line like

from os import *

This will read everything from the os module and drop it straight into your script. If you want to use the access function from os, you can use it directly, like access("myfile.txt"). This has the advantage of saving some typing, but with serious downsides:

  • You now have a lot of strange functions in your script.
  • Worse, if you include more than one module in this way, then you run the risk of functions in the later module overwriting the functions from the first module—ouch!
  • Finally, it’s much harder to remember which module a particular function came from, which makes your program difficult to maintain.

Fortunately, there are much better ways to import modules.

Include the Module

A better way to handle things is with a line like import os. This will import everything in os but make it available only through an os object. Now, if you want to use the access function, you need to use it like this: os.access("myfile.txt"). It’s a bit more typing, but you won’t run the risk of overwriting any other functions.

Include Only the Bits That You Want

If you’re using the functions from a module a lot, you might find that your code becomes hard to read, particularly if the module has a long name. There’s a third option in this case: you can use a line like from os import access. This will import directly so that you can use access ("myfile.txt") without the module name, but only include the access function, not the entire os module. You still run the risk of overwriting with a later module, but, because you have to specify the functions and there are fewer of them, it’s much less likely.

What’s in a library, anyway?

Libraries can include anything that comes with standard Python— variables, functions, and classes, as well as Python code that should be run when the library is loaded. You’re not limited in any way; anything that’s legal in Python is fine to put in a library. When using a library for the first time, it helps to know what’s in it, and what it does. There are two main ways to find out.

 

Tip

dir and help aren’t only useful for libraries. You can try them on all of the Python objects, such as classes and functions. They even support strings and numbers.

 

Read the Fine Manual

Python comes with a detailed manual on every aspect of its use, syntax, standard libraries—pretty much everything you might need to reference when writing programs. It doesn’t cover every possible use, but the majority of the standard library is there. If you have internet access, you can view it at http://docs.python.org, and it’s normally installed alongside Python, too.

Exploration

One useful function for finding out what a library contains is dir(). You can call it on any object to find out what methods it supports, but it’s particularly useful with libraries. You can combine it with the __doc__ special variable, which is set to the doc-string defined for a function or method, to get a quick overview of a library’s or class’s methods and what they do. This combination is so useful that there’s a shortcut called help() that is defined as one of Python’s built-in functions.

For the details, you’re often better off looking at the documentation; but if you only need to jog your memory, or if the documentation is patchy or confusing, dir(), __doc__, and help() are much faster. The following listing is an example of looking up some information about the os library.

Listing 3.1. Finding out more about the os.path library

First, you need to import the os module . You can import os.path directly, but this is the way that it’s normally done, so you’ll have fewer surprises later. Next, you call the dir() function on os.path, to see what’s in it . The function will return a big list of function and variable names, including some built-in Python ones like __doc__ and __name__.

Because you can see a __doc__ variable in os.path, print it and see what it contains . It’s a general description of the os.path module and how it’s supposed to be used.

If you look at the __doc__ variable for a function in os.path , it shows much the same thing—a short description of what the function is supposed to do.

Once you’ve found a function that you think does what you need, you can try it out to make sure . Here, you’re calling os.path.isdir() on a couple of different files and directories to see what it returns. For more complicated libraries, you might find it easier to write a short program rather than type it all in at the command line.

Finally, the output of the help()function contains all the same information that __doc__ and dir() do, but printed nicely. It also looks through the whole object and returns all of its variables and methods without you having to look for them. You can press space or page up and down to read the output, and Q when you want to go back to the interpreter.

In practice, it can often take a combination of these methods before you understand enough about the library for it to be useful. A quick overview of the library documentation, followed by some experimenting at the command line and a further read of the documentation, will provide you with some of the finer points once you understand how it all fits together. Also, bear in mind that you don’t necessarily have to understand the entire library at once, as long as you can pick and choose the pieces you need.

Now that you know the basics of Python libraries, let’s see what you can do with them.

Another way to ask questions

There’s one thing that you need to know before you can start putting your program together. Actually, there are a couple of other things, but you can pick those up on the way. What you’d like to be able to do in order to begin is tell the computer which directories you want to compare. If this were a normal program, you’d probably have a graphical interface where you could click the relevant directories. But that sounds hard, so you’ll pick something simpler to write: a command-line interface.

Using command-line arguments

Command-line arguments are often used in system-level programs. When you run a program from the command line, you can specify additional parameters by typing them after the program’s name. In this case, you’ll be typing in the names of the two directories that you want to compare; something like this:

python difference.py directory1 directory2

If you have spaces in your directory name, you can surround the parameters with quotation marks; otherwise, your operating system will interpret it as two different parameters:

python difference.py "My Documentsdirectory1" "My Documentsdirectory2"

Now that you have your parameters, what are you going to do with them?

Using the sys module

In order to read the parameters you’ve fed in, you’ll need to use the sys module that comes with Python’s standard library. sys deals with all sorts of system-related functionality, such as finding out which version of Python a script is running on, information about the script, paths, and so on. You’ll be using sys.argv, which is an array containing the script’s name and any parameters that it was called with. Your initial program is listing 3.2, which will be the starting point for the comparison script.

Listing 3.2. Reading parameters using sys

First, you check to make sure that the script has been called with enough parameters . If there are too few, then you return an error to the user. Note also that you’re using sys.argv[0] to find out what the name of your script is and sys.exit to end the program early.

Because you know now that there are at least two other values, you can store them for later use . You could use sys.argv directly, but this way, you’ve got a nice variable name, which makes the program easier to understand.

Once you have the variables set, you can print them out to make sure they’re what you’re expecting. You can test it out by trying the commands from the section “Using command-line arguments.” The script should respond back with whatever you’ve specified.

 

Note

File objects are an important part of Python. Quite a few libraries use file-like objects to access other things, like web pages, strings, and the output returned from other programs.

 

If you’re happy with the results, it’s time to start building the program in the next section.

Reading and writing files

The next thing you’ll need to do in your duplicate checker is to find your files and directories and open them to see if they’re the same. Python has built-in support for handling files as well as good cross platform file and directory support via the os module. You’ll be using both of these in your program.

Paths and directories (a.k.a. dude, where’s my file?)

Before you open your file, you need to know where to find it. You want to find all of the files in a directory and open them, as well as any files in directories within that directory, and so on. That’s pretty tricky if you’re writing it yourself; fortunately, the os module has a function called os.walk() that does exactly what you want. The os.walk() function returns a list of all of the directories and files for a path. If you append listing 3.3 to the end of listing 3.2, it will call os.walk() on the directories that you’ve specified.

Listing 3.3. Using os.walk()

You’re going to be doing the same thing for both directory1 and directory2 . You could repeat your code over again for directory2, but if you want to change it later, you’ll have to change it in two places. Worse, you could accidentally change one but not the other, or change it slightly differently. A better way is to use the directory names in a for loop like this, so you can reuse the code within the loop.

It’s good idea to check the input that your script’s been given . If there’s something amiss, then exit with a reasonable error message to let the user know what’s gone wrong.

is the part where you walk over the directory. For now, you’re printing the raw output that’s returned from os.walk(), but in a minute you’ll do something with it.

I’ve set up two test directories on my computer with a few directories that I found lying around. It’s probably a good idea for you to do the same, so you can test your program and know you’re making progress.

If you run the program so far, you should see something like the following output:

D:code>python difference_engine_2_os.py . test1 test2
Comparing:
test1
test2

Directory test1
('C:\test1', ['31123', 'My Music', 'My Pictures', 'test'], [])
('C:\test1\31123', [], [])
('C:\test1\My Music', [], ['Desktop.ini', 'Sample Music.lnk'])
('C:\test1\My Pictures', [], ['Sample Pictures.lnk'])
('C:\test1\test', [], ['foo1.py', 'foo1.pyc', 'foo2.py', 'foo2.pyc', 
     'os.walk.py', 'test.py'])

Directory test2
('C:\test2', ['31123', 'My Music', 'My Pictures', 'test'], [])
('C:\test2\31123', [], [])
('C:\test2\My Music', [], ['Desktop.ini', 'Sample Music.lnk'])
('C:\test2\My Pictures', [], ['Sample Pictures.lnk'])
('C:\test2\test', [], ['foo1.py', 'foo1.pyc', 'foo2.py', 'foo2.pyc', 
'os.walk.py', 'test.py'])

In Python strings, some special characters can be created by using a backslash in front of another character. If you want a tab character, for example, you can put into your string. When Python prints it, it will be replaced with a literal tab character. If you do want a backslash, though—as you do here—then you’ll need to use two backslashes, one after the other.

The output for each line gives you the name of a directory within your path, then a list of directories within that directory, then a list of the files ... handy, and definitely beats writing your own version.

Paths

If you want to use a file or directory, you’ll need what’s called a path. A path is a string that gives the exact location of a file, including any directories that contain it. For example, the path to Python on my computer is C:python26python.exe, which looks like "C:\python26\python.exe" when expressed as a Python string.

If you wanted a path for foo2.py in the last line of the previous listing, you can use os.path.join('C:\test2\test', 'foo2.py'), to get a path that looks like 'C:\test2\test\foo2.py'. You’ll see more of the details when you start putting your program together in a minute.

 

Tip

One thing to keep in mind when using paths is that the separator will be different depending on which platform you’re using. Windows uses a backslash () character, and Linux and Macintosh use a forward slash (/). To make sure your programs work on all three systems, it’s a good idea to get in the habit of using the os.path.join() function, which takes a list of strings and joins them with whatever the path separator is on the current computer.

 

Once you have the location of your file, the next step is opening it.

File, open!

To open a file in Python, you can use the file() or open() built-in function. They’re exactly the same behind the scenes, so it doesn’t matter which one you use. If the file exists and you can open it, you’ll get back a file object, which you can read using the read() or readlines() method. The only difference between read() and readlines() is that readlines() will split the file into strings, but read() will return the file as one big string. This code shows how you can open a file and read its contents:

read_file = file(os.path.join("c:\test1\test", "foo2.py")) 
file_contents = list(read_file.readlines())                  
print "Read in", len(file_contents), "lines from foo2.py"    
print "The first line reads:", file_contents[0]              

First, create a path using os.path.join(), and then use it to open the file at that location. You’ll want to put in the path to a text file that exists on your computer. read_file will now be a file object, so you can use the readlines() method to read the entire contents of the file. You’re also turning the file contents into a list using the list() function. You don’t normally treat files like this, but it helps to show you what’s going on. file_contents is a list now, so you can use the len() function to see how many lines it has, and print the first line by using an index of 0.

Although you won’t be using it in your program, it’s also possible to write text into a file as well as read from it. To do this, you’ll need to open the file with a write mode instead of the default read-only mode, and use the write() or writelines() function of the file object. Here’s a quick example:

You’re using the same file() function you used before, but here you’re feeding it an extra parameter, the string "w", to tell Python that you want to open it for writing .

Once you have the file object back, you can write to it by using the .write() method, with the string you want to write as a parameter . The " " at the end is a special character for a new line; without it, all of the output would be on one line. You can also write multiple lines at once, by putting them into a list and using the .writelines() method instead .

Once you’re done with a file, it’s normally a good idea to close it , particularly if you’re writing to it. Files can sometimes be buffered, which means they’re not written onto the disk straight away—if your computer crashes, it might not be saved.

That’s not all you can do with files, but it’s enough to get started. For your difference engine you won’t need to write files, but it will help for future programs. For now, let’s turn our attention to the last major feature you’ll add to your program.

Comparing files

We’re almost there, but there’s one last hurdle. When you’re running your program, you need to know whether you’ve seen a particular file in the other directory, and if so, whether it has the same content, too. You could read in all the files in and compare their content line by line, but what if you have a large directory with big images? That’s a lot of storage, which means Python is likely to run slowly.

 

Note

It’s often important to consider how fast your program will run, or how much data it will need to store, particularly if the problem that you’re working on is open ended—that is, if it might be run on a large amount of data.

 

Fingerprinting a file

Fortunately, there’s another library to help you, called hashlib, which is used to generate a hash for a particular piece of data. A hash is like a fingerprint for a file: from the data it’s given, it will generate a list of numbers and letters that’s virtually guaranteed to be unique for that data. If even a small part of the file changes, the hash will be completely different, and you’ll be able to detect the change. Best of all, the hashes are relatively small, so they won’t take up much space. The following listing features a small script that shows how you might generate a hash for one file.

Listing 3.4. Generating a hash for a file

After importing your libraries, you read a file name from the command line and open it . Next, you create a hash object here , which will handle all of the hash generation. I’m using md5, but there are many others in hashlib.

Once you have an open file and a hash object, you feed each line of the file into the hash with the update() method .

After you’ve fed all the lines into the hash, you can get the final hash in hexdigest form . It uses only numbers and the letters af, so it’s easy to display on screen or paste into an email.

An easy way to test the script is to run it on itself. After you’ve run it once, try making a minor change to the script, such as adding an extra blank line at the end of the file. If you run the script again, the output should be completely different.

Here, I’m running the hash-generating script on itself. For the same content, it will always generate the same output:

D:	est>python hash.py hash.py             
df16fd6453cedecdea3dddca83d070d4          
D:	est>python hash.py hash.py            
df16fd6453cedecdea3dddca83d070d4  

These are the results of adding one blank line to the end of the hash.py file. It’s a minor change (most people wouldn’t notice it), but now the hash is completely different:

D:	est>hash.py hash.py
47eeac6e2f3e676933e88f096e457911    

Now that your hashes are working, let’s see how you can use them in your program.

Mugshots: storing your files’ fingerprints in a dictionary

Now that you can generate a hash for any given file, you need somewhere to put it. One option is to put the hashes into a list, but searching over a list every time you want to find a particular file is slow, particularly if you have a large directory with lots of files. There’s a better way to do it, by using Python’s other main data type: the dictionary.

You can think of dictionaries as a bag of data. You put data in, give it a name, and then, later, when you want the data back, you give the dictionary its name, and the dictionary will return the data. In Python’s terminology, the name is called a key and the data is called the value for that key. Let’s see how you use a dictionary by taking a look at the following listing.

Listing 3.5. How to use a dictionary

Dictionaries are fairly similar to lists, except that you use curly braces instead of square brackets, and you separate keys and their values with a colon.

The other similarity to lists is that you can include anything that you like as a value , including lists, dictionaries, and other objects. You’re not limited to storing simple types like strings or numbers, or one type of thing. The only constraint is on the key: it can only be something that isn’t modifiable, like a string or number.

To get your value back once you’ve put it in the dictionary, use the dictionary’s name with the key after it in square brackets . If you’re finished with a value, it’s easy to remove it by using del followed by the dictionary and the key that you want to delete .

Dictionaries are objects, so they have some useful methods as well as direct access. keys() returns all of the keys in a dictionary, values() will return its values, and items() returns both the keys and values. Typically, you’ll use it in a for loop, like this:

for key, value in test_dictionary.items(): ...

When deciding what keys and values to use for a dictionary, the best option is to use something unique for the key, and the data you’ll need in your program as the value. You might need to convert the data somehow when building your dictionary, but it normally makes your code easier to write and easier to understand. For your dictionary, you’ll use the path to the file as the key, and the checksum you’ve generated as the value.

Now that you know about hashes and dictionaries, let’s put your program together.

Putting it all together

“Measure twice, cut once” is an old adage that often holds true. When programming, you always have your undo key, but you can’t undo the time you spent writing the code you end up throwing away.

When developing a program, it often helps to have some sort of plan in place as to how you’ll proceed. Your plan doesn’t have to be terribly detailed; but it can help you to avoid potential roadblocks or trouble spots if you can foresee them. Now that you think you have all of the parts you’ll need, let’s plan out the overall design of your program at a high level. It should go something like

  • Read in and sanity-check the directories you want to compare.
  • Build a dictionary containing all the files in the first directory.
  • For each file in the second directory, compare it to the same file in the first dictionary.

That seems pretty straightforward. In addition to having this overall structure, it can help to think about the four different possibilities for each file, as shown in the following figure.

Figure 3.1. The four possibilities for differences between files
Case 1 The file doesn’t exist in directory 2. Case 2 The file exists, but is different in each directory.
Case 3 The files are identical in both. Case 4 The file exists in directory 2, but not in your first directory.

Given this rough approach, a couple of issues should stand out. First, your initial plan of building all the checksums right away may not be such a good idea after all. If the file isn’t in the second directory, then you’ll have gone to all the trouble of building a checksum that you’ll never use. For small files and directories it might not make much difference, but for larger ones (for example, photos from a digital camera or MP3s), the extra time might be significant. The solution to this is to put a placeholder into the dictionary that you build and only generate the checksum once you know you have both files.

 

Can’t You Use A List?

If you’re putting a placeholder into your dictionary instead of a checksum, you’d normally start by using a list. Looking up a value in a dictionary is typically much faster, though; for large lists, Python needs to check each value in turn, whereas a dictionary needs a single lookup. Another good reason is that dictionaries are more flexible and easier to use than lists if you’re comparing independent objects.

 

Second, what happens if a file is in the first directory but not the second? Given the rough plan we just discussed, you’re only comparing the second directory to the first one, not vice versa. You won’t notice a file if it’s not in the second directory. One solution to this is to delete the files from the dictionary as you compare them. Once you’ve finished the comparisons, you know that anything left over is missing from the second directory.

Planning like this can take time, but it’s often faster to spend a little time up front working out potential problems. What’s better to throw away when you change your mind: five minutes of design or half an hour of writing code? Listings 3.6 and 3.7 show the last two parts of your program based on the updated plan. You can join these together with listings 3.2 and 3.3 to get a working program.

Listing 3.6. Utility functions for your difference program

This is the program from listing 3.5, rolled up into a function. Notice how a docstring has been added as the second line so it’s easy to remember what the function does.

Because you’ll be building a list of files for two directories, it makes sense to have a function that returns all the information you need about a directory , so you can reuse it each time. The two things you need are the root, or lowest-level directory (the one typed in at the command line) and a list of all the files relative to that root so you can compare the two directories easily. For example, C: est est_dirfile.txt and C: est2 est_dirfile.txt should both be entered into their respective dictionaries as est_dirfile.txt.

Because os.walk() starts at the root of a directory by default, all you need to do is remember the first directory that it returns . You do that by setting dir_root to None before you enter the for loop. None is a special value in Python that means “not set” or “value unknown.” It’s what you use if you need to define a variable but don’t know its value yet. Inside the loop, if dir_root is None, you know it’s the first time through the loop and you have to set it. You’re setting a dir_trim variable too, so that later you can easily trim the first part of each directory that’s returned.

Once you have your directory root, you can chop off the common part of your directories and path separators from the front of the path returned by os.walk() . You do that by using string slices, which will return a subsection of a string. It works in exactly the same way as a list index, so it starts at 0 and can go up to the length of the string.

When you’re done, you return both the directory listing and the root of the directory using a special Python data type called a tuple. Tuples are similar to lists, except that they’re immutable—you can’t change them after they’ve been created.

Now that you’ve checked your inputs and set up all of your program’s data, you can start making use of it. As in chapter 2, when you simplified Hunt the Wumpus, the part of the program that does stuff is fairly short, clear, and easy to understand. All the tricky details have been hidden away inside functions, as you can see in the next listing.

Listing 3.7. Finding the differences between directories

To assign both of the variables you get back from your function, you separate them with a comma . You’ve already seen this when using dictionary.items() in a for loop.

Here’s the first comparison : if the file isn’t in directory 1, then you warn the user. You can use in with a dictionary in the same way that you would for a list, and Python will return True if the object is in the dictionaries’ keys.

If the file exists in both directories, then you build a checksum for each file and compare them . If they’re different, then you know the files are different and you again warn the user. If the checksums are the same then you keep quiet, because you don’t want to overwhelm people with screens and screens of output—they want to know the differences.

Once you’ve compared the files in section 3, you delete them from the dictionary. Any that are left over you know aren’t in directory 2 and you tell the user about them .

That seems to about do it for your program, but are you sure it’s working? Time to test it.

Testing your program

If you haven’t already, now’s probably a good time to create some test directories so you can try your script and make sure it’s working. It’s especially important as you start working on problems that have real-world consequences. For example, if you’re backing up some family photos and your program doesn’t report that a file has changed (or doesn’t exist), you won’t know to back it up and might lose it if your hard drive crashes. Or it might report two files as the same when they’re actually different.

You can test your script on directories that you already have, but specific test directories are a good idea, mainly because you can exercise all the features you’re expecting. At a minimum, I’d suggest

  • Adding at least two directory levels, to make sure paths are handled properly
  • Creating a directory with at least one space in its name
  • Using both text and binary files (for example, images)
  • Setting up all the cases you’re expecting (files missing, file differences, files that are the same)

By thinking about all the possible cases, you can catch bugs in your program before you run it over a real directory and miss something or, worse, lose important data. The following figure shows the initial test directory (called test) that I set up on my computer.

Figure 3.2. A test directory for the difference engine

This test directory doesn’t get all the possible failures, but it does check for most of them. The next step was to copy that directory (I called it test2) and make some changes for the difference engine to work on, as shown in figure 3.3. I’ve used the numbers 1 to 4 within the files to represent each of the possible cases, with 1 and 4 being missing files, 2 for files that have some differences, and 3 for files that are identical in both directories.

Figure 3.3. test2, an almost identical copy of the first test directory

You can see the output of running your script over these directories:

D:>python codedifference_engine.py test test2
Comparing:
test
test2
dir test root is test
dir test2 root is test2
test	est 2	est2.txt and test2	est 2	est2.txt differ!
image4.gif not found in directory 1
test 2	est4.txt not found in directory 1
testimage2.gif and test2image2.gif differ!
test4.txt not found in directory 1
test	est2.txt and test2	est2.txt differ!
test1.txt not found in directory 2
test 2	est1.txt not found in directory 2
image1.gif not found in directory 2

That seems to be pretty much what you were expecting. The script is descending into the test 2 directory in each case and is picking up the differences between the files—1 and 4 are missing, 2 is different, and 3 isn’t reported because the files are identical.

Now that you’ve tested out your script, let’s see what you can do to improve it.

Improving your script

Your script so far works, but it could do with a few improvements. For a start, the results it returns are out of order. The files that are missing from the second directory appear right at the end. Ideally, you’d have them appear next to the other entries for that directory, to make it easier to see what the differences are.

 

Note

Does this strategy look familiar? It’s exactly what you did when developing Hunt the Wumpus. You start by writing a program that’s as simple as you can make it and then build on the extra features that you need.

 

Putting results in order

It initially might be difficult to see how you might go about ordering the results, but if you think back to chapter 2, one of the strategies that you used with Hunt the Wumpus was to separate the program from its interface. In your difference engine, you haven’t done so much of that so far—now might be a good time to start. You need two parts to your program: one part that does the work and stores the data it generates, and another to display that data. The following listing shows how you generate your results and store them.

Listing 3.8. Separating generated results from display

Here’s the trick. Rather than try to display the results as soon as you get them, which means you’re trying to shoehorn your program structure into your display structure, you store the results in a dictionary to display later .

The result of each comparison is stored in result , with the file path as the key and a description of the result of the comparison as the value.

That should take care of storing the results; let’s take a look at how you display them:

sorted() is a built-in Python function that sorts groups of items . You can give it lists, dictionary keys, values or items, strings, and all sorts of other things. In this case, you’re using it to sort result.items() by file_path, the first part of result.items().

Within the body of the loop, you’re using in to check the contents of the strings . You want to know whether this path is part of a directory, in which case it will have os.path.sep somewhere within it, and you also want to know whether the result shows that the files are the same.

Now that you’ve displayed everything within the root of the directory, you can go ahead and show everything within the subdirectories . You’re reversing the sense of the if statement to show everything that wasn’t shown the first time around.

In hindsight, that was relatively easy. Following the pattern you established in Hunt the Wumpus, separating data from its display is a powerful tactic that can make complicated problems easy to understand and program.

Comparing directories

The other thing your program should probably handle is the case where you have empty directories. Currently it only looks for files, and any empty directories will be skipped. Although unnecessary for your initial use case (checking for missing images before you back up), it will almost certainly be useful somewhere down the track. Once you’ve added this feature, you’ll be able to spot any change in the directories, short of permission changes to the files—and it requires surprisingly little code. The next listing shows how I did it.

Listing 3.9. Comparing directories, too

The first thing to do is to include directory paths as well as files when generating a listing . To do that, you join the dirs and files lists with the + operator.

If you try to open a directory to read its contents, you’ll get an error ; this is because directories don’t have contents the same way files do. To get around that, it’s ok to cheat a little bit. You alter the md5 function and use os.path.isdir() to find out whether it’s a directory. If it is, you return a dummy value of '1'. It doesn’t matter what the contents of a directory are, because the files will be checked in turn, and you only care whether a directory exists (or not).

Once you’ve made those changes, you’re done. Because the directories follow the same data structure as the files, you don’t need to make any changes to the comparison or display parts of your program. You’ll probably want to add some directories to both your test directories to make sure the program is working properly.

You’ve improved your script, but that doesn’t mean there isn’t more you can do.

Where to from here?

The program as it stands now is feature-complete based on your initial need, but you can use the code you’ve written so far for other purposes. Here are some ideas:

  • If you’re sure you won’t have any different files, you can extend the program to create a merged directory from multiple sources. Given a number of directories, consolidate their contents into a third, separate location.
  • A related task would be to find all the identical copies of a file in a directory—you might have several old backups and want to know whether there are any sneaky extra files you’ve put in one of them.
  • You could create a change monitor—a script that notifies you of changes in one directory. One script would look at a directory and store the results in a file. The second script would look at that file and directory and tell you if any of the output has changed. Your storage file doesn’t have to be complicated—a text file containing a path and checksum for each file should be all you need.
  • You can also use your os.walk functions as a template to do something other than check file contents. A script to check directory sizes could be useful. Your operating system will probably give you information about how much space a particular directory takes up, but what if you want to graph usage over time, or break your results down by file type? A script is much more flexible, and you can make it do whatever you need.

You’ll need to avoid the temptation of reinventing the wheel. If a tool has already been written that solves your problem, it’s generally better to use that, or at least include it in your script if possible. For example, you might consider writing a program that shows you the changes between different versions of files as well as whether they’re different—but that program’s already been written; it’s called diff. It’s widely available as a command-line program under Linux, but it’s also available for Windows and comes in graphical versions, too.

One of the other programming tricks is knowing when to stop. Gold-plating your program can be fun, but you could always be working on your next project instead!

Summary

In this chapter, you learned about some of the standard library packages available with every installation of Python, as well as how to include and use them and how to learn about unfamiliar ones. You built what would normally be a fairly complex application, but, because you made good use of several Python libraries, the amount of code you had to write was minimal.

In the next chapter, we’ll look at another way of organizing programs, as well as other uses for functions and some other Python techniques that can help you to write clearer, more concise code. The program in this chapter was fairly easy to test, but not all programs will be that straightforward, so we’ll also look at another way of testing programs to make sure they work.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.185.147