Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11. Text Processing

There is a whole range of applications for which scripting languages like Python are perfectly suited; and in fact scripting languages were arguably invented specifically for these applications, which involve the simple search and processing of various files in the directory tree. Taken together, these applications are often called text processing. Python is a great scripting tool for both writing quick text processing scripts and then scaling them up into more generally useful code later, using its clean object-oriented coding style.

In this chapter you learn:

Some of the typical reasons you need text processing scripts
A few simple scripts for quick system administration tasks
How to navigate around in the directory structure in a platform-independent way, so your scripts will work fine on Linux, Windows, or even the Mac
How to create regular expressions to compare the files found by the os and os.path modules
How to use successive refinement to keep enhancing your Python scripts to winnow through the data found

Text processing scripts are one of the most useful tools in the toolbox of anybody who seriously works with computer systems, and Python is a great way to do text processing. You're going to like this chapter.

Why Text Processing Is So Useful

In general, the whole idea behind text processing is simply finding things. There are, of course, situations in which data are organized in a structured way; these are called databases and that's not what this chapter is about. Databases carefully index and store data in such a way that if you know what you're looking for, you can retrieve it quickly. However, in some data sources, the information is not at all orderly and neat, such as directory structures with hundreds or thousands of files, or logs of events from system processes consisting of thousands or hundreds of thousands of lines, or even e-mail archives with months of exchanges between people.

When data of that nature needs to be searched for something, or processed in some way, text processing is in its element. Of course, there's no reason not to combine text processing with other data-access methods; you might find yourself writing scripts rather often that run through thousands of lines of log output and do occasional RDBMS lookups (Relational DataBase Management Systems — you learn about these in Chapter 14) on some of the data they run across. This is a natural way to work.

Ultimately, this kind of script can very often get used for years as part of a back-end data processing system. If the script is written in a language like Perl, it can sometimes be quite opaque when some poor soul is assigned five years later to "fix it." Fortunately, this is a book about Python programming, and so the scripts written here can easily be turned into reusable object classes — later, you look at an illustrative example.

The two main tools in your text processing belt are directory navigation, and an arcane technology called regular expressions. Directory navigation is one area in which different operating systems can really wreak havoc on simple programs, because the three major operating system families (UNIX, Windows, and the Mac) all organize their directories differently; and, most painfully, they use different characters to separate subdirectory names. Python is ready for this, though — a series of cross-platform tools are available for the manipulation of directories and paths that, when used consistently, can eliminate this hassle entirely. You saw these in Chapter 8, and you see more uses of these tools here.

A regular expression is a way of specifying a very simple text parser, which then can be applied relatively inexpensively (which means that it will be fast) to any number of lines of text. Regular expressions crop up in a lot of places, and you've likely seen them before. If this is your first exposure to them, however, you'll be pretty pleased with what they can do. In the scope of this chapter, you're just going to scratch the surface of full-scale regular expression power, but even this will give your scripts a lot of functionality.

You first look at some of the reasons you might want to write text processing scripts, and then you do some experimentation with your new knowledge. The most common reasons to use regular expressions include the following:

Searching for files
Extracting useful data from program logs, such as a web server log
Searching through your e-mail

The following sections introduce these uses.

Searching for Files

Searching for files, or doing something with some files, is a mainstay of text processing. For example, suppose that you spent a few months ripping your entire CD collection to MP3 files, without really paying attention to how you were organizing the hundreds of files you were tossing into some arbitrarily made-up set of directories. This wouldn't be a problem if you didn't wait a couple of months before thinking about organizing your files into directories according to artist — and only then realized that the directory structure you ended up with was hopelessly confused.

Text processing to the rescue! Write a Python script that scans the hopelessly nonsensical directory structure and then divide each file name into parts that might be an artist's name. Then take that potential name and try to look it up in a music database. The result is that you could rearrange hundreds of files into directories by, if not the name of the artist, certainly some pretty good guesses, which will get you close to having a sensible structure. From there, you would be able to explore manually and end up actually having an organized music library.

This is a one-time use of a text processing script, but you can easily imagine other scenarios in which you might use a similarly useful script on a regular basis, such as when you are handling data from a client or from a data source that you don't control. Of course, if you need to do this kind of sorting often, you can easily use Python to come up with some organized tool classes that perform these tasks to avoid having to duplicate your effort each time.

Whenever you face a task like this, a task that requires a lot of manual work manipulating data on your computer, think Python. Writing a script or two could save you hours and hours of tedious work.

A second but similar situation results as a fallout of today's large hard disks. Many users store files willy-nilly on their hard disk, but never seem to have the time to organize them. A worse situation occurs when you face a hard disk full of files and you need to extract some information you know is there on your computer, but you're not sure where exactly. You are not alone. Apple, Google, Microsoft, and others all have desktop search techniques that help you search through the data in the files you have collected to help you to extract useful information.

Think of Python as a desktop search on steroids, because you can create scripts with a much finer control over the search, as well as perform operations on the files found.

Clipping Logs

Another common text-processing task that comes up in system administration is the need to sift through log files for various information. Scripts that filter logs can be spur-of-the-moment affairs meant to answer specific questions (such as "When did that e-mail get sent?" or "When was the last time my program log one specific message?"), or they might be permanent parts of a data processing system that evolves over time to manage ongoing tasks. These could be a part of a system administration and performance-monitoring system, for instance. Scripts that regularly filter logs for particular subsets of the information are often said to be clipping logs — the idea being that, just as you clip polygons to fit on the screen, you can also clip logs to fit into whatever view of the system you need.

However you decide to use them, after you gain some basic familiarity with the techniques used, these scripts become almost second nature. This is an application where regular expressions are used a lot, for two reasons: First, it's very common to use a UNIX shell command like grep to do first-level log clipping; second, if you do it in Python, you'll probably be using regular expressions to split the line into usable fields before doing more work with it. In any one clipping task, you may very well be using both techniques.

After a short introduction to traversing the file system and creating regular expressions, you look at a couple of scripts for text processing in the following sections.

Sifting through Mail

The final text processing task is one that you've probably found useful (or if you haven't, you've badly wanted it): the processing of mailbox files to find something that can't be found by your normal Inbox search feature. The most common reason you need something more powerful for this is that the mailbox file is either archived, so that you can access the file, but not read it with your mail reader easily, or it has been saved on a server where you've got no working mail client installed. Rather than go through the hassle of moving it into your Inbox tree and treating it like an active folder, you might find it simpler just to write a script to scan it for whatever you need.

However, you can also easily imagine a situation in which your search script might want to get data from an outside source, such as a web page or perhaps some other data source, like a database (see Chapter 14 for more about databases), to cross-reference your data, or do some other task during the search that can't be done with a plain vanilla mail client. In that case, text processing combined with any other technique can be an incredibly useful way to find information that may not be easy to find any other way.

Navigating the File System with the os Module

The os module and its submodule os.path are one of the most helpful things about using Python for a lot of day-to-day tasks that you have to perform on a lot of different systems. If you often need to write scripts and programs on either Windows or UNIX that would still work on the other operating system, you know from Chapter 8 that Python takes care of much of the work of hiding the differences between how things work on Windows and UNIX.

In this chapter, we're going to completely ignore a lot of what the os module can do (ranging from process control to getting system information) and just focus on some of the functions useful for working with files and directories. Some things you've been introduced to already, and others are new.

One of the difficult and annoying points about writing cross-platform scripts is the fact that directory names are separated by backslashes () under Windows, but forward slashes (/) under UNIX. Even breaking a full path down into its components is irritatingly complicated if you want your code to work under both operating systems.

Furthermore, Python, like many other programming languages, makes special use of the backslash character to indicate special text, such as for a newline. This complicates your scripts that create file paths on Windows.

With Python's os.path module, however, you get some handy functions that will split and join path names for you automatically with the right characters, and they'll work correctly on any OS that Python is running on (including the Mac.) You can call a single function to iterate through the directory structure and call another function of your choosing on each file it finds in the hierarchy. You see a lot of that function in the examples that follow, but first look at an overview of some of the useful functions in the os and os.path modules that you'll be using.

Function Name, as Called	Description
`os.getcwd()`	Returns the current directory. You can think of this function as the basic coordinate of directory functions in whatever language.
`os.listdir(``directory``)`	Returns a list of the names of files and subdirectories stored in the named directory. You can then run `os.stat()` on the individual files — for example, to determine which are files and which are subdirectories.
`os.stat(``path``)`	Returns a tuple of numbers, which give you everything you could possibly need to know about a file (or directory). These numbers are taken from the structure returned by the ANSI C function of the same name, and they have the following meanings (some are dummy values under Windows, but they're in the same places!): st_mode: permissions on the file st_ino: inode number (UNIX) st_dev: device number st_nlink: link number (UNIX) st_uid: userid of owner st_gid: groupid of owner st_size: size of the file st_atime: time of last access st_mtime: time of last modification st_ctime: time of creation
`os.path.split(``path``)`	Splits the path into its component names appropriately for the current operating system. Returns a tuple, not a list. This always surprises me.
`os.path.join(``components``)`	Joins name components into a path appropriate to the current operating system.
`os.path.normcase(``path``)`	Normalizes the case of a path. Under UNIX, this has no effect because file names are case-sensitive; but under Windows, where the OS will silently ignore case when comparing file names, it's useful to run `normcase` on a path before comparing it to another path so that if one has capital letters, but the other doesn't, Python will be able to compare the two the same way that the operation system would — that is, they'd be the same regardless of capitalizations in the path names, as long as that's the only difference. Under Windows, the function returns a path in all lowercase and converts any forward slashes into backslashes.
`os.walk(``top``,` `topdown=True``,` `onerror=None``,` `followlinks=False``)`	This is a brilliant function that iterates down through a directory tree from top-down or bottom-up. For each directory, it creates a 3-tuple consisting of dirpath, dirnames, and filenames. The dirpath portion is a string that holds the path of your directory. Dirnames is a list of subdirectories from dirpath, which exclude '.' and '..'. Lastly, filenames is a listing of every non-directory file in dirpath.

There are more functions where those came from, but these are the ones used in the example code that follows. You will likely use these functions far more than any others in these modules. You can find many other useful functions in the Python module documentation for os and os.path.

Try It Out: Listing Files and Playing with Paths

The best way to get to know functions in Python is to try them out in the interpreter. Try some of the preceding functions to see what the responses will look like.

From the Python interpreter, import the os and os.path modules:
```
>>> import os, os.path
```
First, see where you are in the file system. This example is being done under Windows, so your mileage will vary:
```
>>> os.getcwd()
'C:\Python31'
```
If you want to do something with this programmatically, you'll probably want to break it down into the directory path, as a tuple (use join to put the pieces back together):
```
>>> os.path.split (os.getcwd())
('C:', 'Python31')
```

To find out some interesting things about the directory, or any file, use os.stat:

>>> os.stat('.')
nt.stat_result(st_mode=16895, st_ino=0, st_dev=0, st_nlink=0,
st_uid=0, st_gid=0, st_size=8192, st_atime=1239767131,
st_mtime=1239767131, st_ctime=1234912369)

Note

Note that the directory named '.' is shorthand for the current directory.

If you actually want to list the files in the directory, do this:

>>> os.listdir('.')
['.javaws', '.limewire', 'Application Data', 'Cookies',
'Desktop', 'Favorites', 'gsview32.ini', 'Local Settings',
'My Documents', 'myfile.txt', 'NetHood', 'NTUSER.DAT',
'ntuser.dat.LOG', 'ntuser.ini', 'PrintHood', 'PUTTY.RND',
'Recent', 'SendTo', 'Start Menu', 'Templates', 'UserData', 'WINDOWS']

How It Works

Most of that was perfectly straightforward and easy to understand, but let's look at a couple of points before going on and writing a complete script or two.

First, you can easily see how you might construct an iterating script using listdir, split, and stat — but you don't have to, because os.path provides the walk function to do just that, as you see later. The walk function not only saves you the time and effort of writing and debugging an iterative algorithm where you search everything in your own way, but it also runs a bit faster because it's a built into Python, but written in C, which can make things easier in cases like this. You probably will seldom want to write iterators in Python when you've already got something built in that does the same job.

Second, note that the output of the stat call, which comes from a system call, is pretty opaque. The tuple it returns corresponds to the structure returned from the POSIX C library function of the same name, and its component values are described in the preceding table; and, of course, in the Python documentation. The stat function really does tell you nearly anything you might want to know about a file or directory, so it's a valuable function to understand for when you'll need it, even though it's a bit daunting at first glance.

Try It Out: Searching for Files of a Particular Type

If you have worked with any other programming languages, you'll like how easy searching for files is with Python. Whether or not you've done this before in another language, you'll notice how the example script is extremely short for this type of work. The following example uses the os and os.path modules to search for PDF files in the directory — which means the current directory — wherever you are when you call the function. On a UNIX or Linux system, you could use the command line and, for example, the UNIX find command. However, if you don't do this too often that would mean that each time you wanted to look for files, you'd need to figure out the command-line syntax for find yet again. (Because of how much find does, that can be difficult — and that difficulty is compounded by how it expects you to be familiar already with how it works!) Also, another advantage to doing this in Python is that by using Python to search for files you can refine your script to do special things based on what you find, and as you discover new uses for your program, you can add new features to it to find files in ways that you find you need. For instance, as you search for files you may see far too many results to look at. You can refine your Python script to further winnow the results to find just what you need.

This is a great opportunity to show off the nifty os.walk function, so that's the basis of this script. This function is great because it will do all the heavy lifting of file system iteration for you, leaving you to write a simple function to do something with whatever it finds along the way:

Using your favorite text editor, open a script called scan_pdf.py in the directory you want to scan for PDFs and enter the following code:

import os, os.path
import re

def print_pdf (root, dirs, files):
   for file in files:
      path = os.path.join (root, file)
      path = os.path.normcase (path)
      if re.search (r".*.pdf", path):
        print(path)

for root, dirs, files in os.walk('.'):

Run it. Obviously, the following output will not match yours. For the best results, add a bunch of files that end in .pdf to this directory!

$ python scan_pdf.py
.95-04.pdf
.
on-disclosure agreement 051702.pdf
.word pro - dokument in lotus word pro 9 dokument45.pdf
.101translations20031218032003121803.pdf
.101translations2004101810scan.pdf
.luemangospurchase order - michael roberts smb-pt134.pdf
.luemangossmb_pt134.pdf
.usinessteam.huaok.pdf
.usinessteam.huchn14300-2.pdf
.usinessteam.hudiplom_bardelmeier.pdf
.usinessteam.hudoktor_bardelmeier.pdf
.usinessteam.hufinanzamt_1.pdf
.usinessteam.huzollbescheinigung.pdf
.usinessteam.humondays3.pdf
.usinessteam.humondays4.pdf
.usinessteam.humondays5.pdf
.gerarddone	g82-20nc-md-04.07.pdf

.gerardpolytroniciau-reglement_2005.pdf
.gerardpolytronic	g82-20bes user manual	g82-20bes-md-27.05.pdf
.glossa
eumagde_993_ba_s5.pdf
.glossapepperl+fuchs5626eng3convocab - 3522a_recom_flsd.pdf
.glossapepperl+fuchs5769eng45769eng4 - td4726_8400 d-e - 16.02.04.pdf

How It Works

This is a nice little script, isn't it? Python does all the work, and you get a list of the PDFs in your directories, including their location and their full names — even with spaces, which can be difficult to deal with under UNIX and Linux.

A little extra work with the paths has been done so that it's easier to see what's where: a call to os.path.join builds the full (relative) path name of each PDF from the starting directory and a call to os.path.normcase makes sure that all the file names are lowercase under Windows. Under UNIX, normcase would have no effect, because case is significant under UNIX, so you don't want to change the capitalization (and it doesn't change it), but under Windows, it makes it easier to see whether the file name ends in .pdf if you have them all appear in lowercase.

Note the use of a very simple regular expression to check the ending of the file name. You could also have used os.path.splitext to get a tuple with the file's base name and its extension, and compared that to pdf, which arguably would have been cleaner. However, because this script is effectively laid out as a filter, starting it out with a regular expression, also called regexp, comparison from the beginning makes sense. Doing it this way means that if you decide later to restrict the output in some way, like adding more filters based on needs you find you have, you can just add more regexp comparisons and have nice, easy-to-understand code in the text expression. This is more a question of taste than anything else. (It was also a good excuse to work in a first look at regular expressions and to demonstrate that they're really not too hard to understand.)

If you haven't seen it before, the form r"<string constant>" simply tells Python that the string constant should suppress all special processing for backslash values. Thus, whereas " " is a string one character in length containing a newline, r" " is a string two characters in length, containing a backslash character followed by the letter 'n'. Because regular expressions tend to contain a lot of backslashes, it's very convenient to be able to suppress their special meaning with this switch.

Try It Out: Refining a Search

As it turned out, there were few enough PDF files (about 100) in the example search results that you should be able to find the files you were looking for simply by looking through the list; but very often when doing a search of this kind you first look at the results you get on the first pass and then use that knowledge to zero in on what you ultimately need. The process of zeroing in involves trying out the script, and then as you see that it could be returning better results, making successive changes to your scripts to better find the information you want.

To get a flavor of that kind of successive or iterative programming, assume that instead of just showing all the PDFs, you also want to exclude all PDFs with a space in the name. For example, because the files you were looking for were downloaded from websites, they in fact wouldn't have spaces, whereas many of the files you received in e-mail messages were attachments from someone's file system and therefore often did. Therefore, this refinement is a very likely one that you'll have an opportunity to use:

Using your favorite text editor again, open scan_pdf.py and change it to look like the following (the changed portions are in italics; or, if you skipped the last example, just enter the entire code as follows):

import os, os.path
import re

def print_pdf (arg, dir, files):
   for file in files:
      path = os.path.join (dir, file)
      path = os.path.normcase (path)
      if not re.search (r".*.pdf", path): continue
      if re.search (r" ", path): continue

      print(path)

for root, dirs, files in os.walk('.'):

Now run the modified script — and again, this output will not match yours:
```
$ python scan_pdf.py
.95-04.pdf
.101translations20031218032003121803.pdf
.101translations2004101810scan.pdf
.luemangossmb_pt134.pdf
.usinessteam.huaok.pdf
.usinessteam.huchn14300-2.pdf
.usinessteam.hudiplom_bardelmeier.pdf
.usinessteam.hudoktor_bardelmeier.pdf
.usinessteam.hufinanzamt_1.pdf
.usinessteam.huzollbescheinigung.pdf
.usinessteam.humondays3.pdf
.usinessteam.humondays4.pdf
.usinessteam.humondays5.pdf
.gerarddone	g82-20nc-md-04.07.pdf
.gerardpolytroniciau-reglement_2005.pdf
.glossa
eumagde_993_ba_s5.pdf
```
How It Works
There's a stylistic change in this code — one that works well when doing these quick text-processing-oriented filter scripts. Look at the print_pdf function in the code — first build and normalize the path name and then run tests on it to ensure that it's the one you want. After a test fails, it will use continue to skip to the next file in the list. This technique enables a whole series of tests to be performed one after another, while keeping the code easy to read.

Working with Regular Expressions and the re Module

Perhaps the most powerful tool in the text processing toolbox is the regular expression. Though matching on simple strings or substrings is useful, they're limited. Regular expressions pack a lot of punch into a few characters, but they're so powerful that it really pays to get to know them. The basic regular expression syntax is used identically in several programming languages, and you can find at least one book written solely on their use and thousands of pages in other books (like this one).

As mentioned previously, a regular expression defines a simple parser that matches strings within a text. Regular expressions work essentially in the same way as wildcards when you use them to specify multiple files on a command line, in that the wildcard enables you to define a string that matches many different possible file names. In case you didn't know what they were, characters like * and ? are wildcards that, when you use them with commands such as dir on Windows or ls on UNIX, will let you select more than one file, but possiblly fewer files than every file (as does dir win*, which will print only files in your directory on Windows that start with the letters w, i, and n and are followed by anything — that's why the * is called a wildcard). Two major differences exist between a regular expression and a simple wildcard:

A regular expression can match multiple times anywhere in a longer string.
Regular expressions are much, much more complicated and much richer than simple wildcards, as you will see.

The main thing to note when starting to learn about regular expressions is this: A string always matches itself. Therefore, for instance, the pattern 'xxx' will always match itself in 'abcxxxabc'. Everything else is just icing on the cake; the core of what we're doing is just finding strings in other strings.

You can add special characters to make the patterns match more interesting things. The most commonly used one is the general wildcard '.' (a period or dot). The dot matches any one character in a string; so, for instance, 'x.x' will match the strings 'xxx' or 'xyx' or even 'x.x'.

The last example raises a fundamental point in dealing with regular expressions. What if you really only want to find something with a dot in it, like 'x.x'? Actually, specifying 'x.x' as a pattern won't work; it will also match 'x!x' and 'xqx'. Instead, regular expressions enable you to escape special characters by adding a backslash in front of them. Therefore, to match 'x.x' and only 'x.x', you would use the pattern 'x.x', which takes away the special meaning of the period as with an escaped character.

However, here you run into a problem with Python's normal processing of strings. Python also uses the backslash for escape sequences, because ' ' specifies a carriage return and ' ' is a tab character. To avoid running afoul of this normal processing, regular expressions are usually specified as raw strings, which as you've seen is a fancy way of saying that you tack an 'r' onto the front of the string constant, and then Python treats them specially.

So after all that verbiage, how do you really match 'x.x'? Simple: You specify the pattern r"x.x". Fortunately, if you've gotten this far, you've already made it through the hardest part of coming to grips with regular expressions in Python. The rest is easy.

Before you get too far into specifying the many special characters used by regular expressions, first look at the function used to match strings, and then do some learning by example, by typing a few regular expressions right into the interpreter.

Try It Out: Fun with Regular Expressions

This exercise uses some functional programming tools that you may have seen before but perhaps not had an opportunity to use yet. The idea is to be able to apply a regular expression to a bunch of different strings to determine which ones it matches and which ones it doesn't. To do this in one line of typing, you can use the filter function, but because filter applies a function of one argument to each member of its input list, and re.match and re.search take two arguments, you're forced to use either a function definition or an anonymous lambda form (as in this example). Don't think too hard about it (you can return to Chapter 9 to see how this works again), because it will be obvious what it's doing:

Start the Python interpreter and import the re module:
```
$ python>>> import re
```
Now define a list of interesting-looking strings to filter with various regular expressions:
```
>>> s = ('xxx', 'abcxxxabc', 'xyx', 'abc', 'x.x', 'axa', 'axxxxa', 'axxya')
```

Do the simplest of all regular expressions first:

>>> a=filter ((lambda s: re.match(r"xxx", s)), s)
 >>>print(*a)
xxx

Hey, wait! Why didn't that find 'axxxxa', too? Even though you normally talk about matches inside the string, in Python the re.match function looks for matches only at the start of its input. To find strings anywhere in the input, use re.search (which spells the word research, so it's cooler and easy to remember anyway):
```
>>> b=filter ((lambda s: re.search(r"xxx", s)), s)
 >>>print(*b)
xxx, abcxxxabc, axxxxa
```

OK, look for that period:

>>>c=filter ((lambda s: re.search(r"x.x", s)), s)
 >>>print(*c)
xxx, abcxxxabc, xyx, x.x, axxxxa

Here's how you match only the period (by escaping the special character):
```
>>> d=filter ((lambda s: re.search(r"x.x", s)), s)
 >>>print(*d)
x.x
```
You also can search for any number of x's by using the asterisk, which can match a series of whatever character is in front of it:
```
>>> e=filter ((lambda s: re.search(r"x.*x", s)), s)
 >>>print(*e)
xxx, abcxxxabc, xyx, x.x, axxxxa, axxya
```
Wait a minute! How did 'x.*x' match 'axxya' if there was nothing between the two x's? The secret is that the asterisk is tricky — it matches zero or more occurrences of a character between two x's. If you really want to make sure something is between the x's, use a plus instead, which matches one or more characters:
```
>>>f=filter ((lambda s: re.search(r"x.+x", s)), s)
 >>>print(*f)
xxx, abcxxxabc, xyx, x.x, axxxxa
```

Now you know how to match anything with, say, a 'c' in it:

>>> g=filter ((lambda s: re.search(r"c+", s)), s)
 >>>print(*g)
abcxxxabc, abc

Here's where things get really interesting: How would you match anything without a 'c'? Regular expressions use square brackets to denote special sets of characters to match, and if there's a caret at the beginning of the list, it means all characters that don't appear in the set, so your first idea might be to try this:
```
>>>h=filter ((lambda s: re.search(r"[^c]*", s)), s)
 >>>print(*h)
xxx, abcxxxabc, xyx, abc, x.x, axa, axxxxa, axxya
```
That matched the whole list. Why? Because it matches anything that has a character that isn't a 'c', you negated the wrong thing. To make this clearer, you can filter a list with more c's in it:
```
>>>h=filter ((lambda s: re.search(r"[^c]*", s)), ('c', 'cc', 'ccx'))
 >>>print(*h)
c, cc, ccx
```
Note
Note that older versions of Python may return a different tuple, ('ccx').
To really match anything without a 'c' in it, you have to use the ^ and $ special characters to refer to the beginning and end of the string and then tell re that you want strings composed only of non-c characters from beginning to end:
```
>>>i=filter ((lambda s: re.search(r"^[^c]*$", s)), s)
 >>>print(*i)
xxx, xyx, x.x, axa, axxxxa, axxya
```

As you can see from the last example, getting re to understand what you mean can sometimes require a little effort. It's often best to try out new regular expressions on a bunch of data you understand and then check the results carefully to ensure that you're getting what you intended; otherwise, you can get some real surprises later!

Use the techniques shown here in the following example. You can usually run the Python interpreter in interactive mode, and test your regular expression with sample data until it matches what you want.

Try It Out: Adding Tests

The example scan_pdf.py scripts shown so far provide a nicely formatted framework for testing files. As mentioned previously, the os.walk function provides the heavy lifting. The print_pdf function you write performs the tests — in this case, looking for PDF files.

Clocking in at less than 20 lines of code, these examples show the true power of Python. Following the structure of the print_pdf function, you can easily add tests to refine the search, as shown in the following example:

Using your favorite text editor again, open scan_pdf.py and change it to look like the following. The changed portions are in italics (or, if you skipped the last example, just enter the entire code that follows):

import os, os.path
import re

def print_pdf (arg, dir, files):
   for file in files:
      path = os.path.join (dir, file)
      path = os.path.normcase (path)
      if not re.search (r".*.pdf", path): continue
      if re.search (r"..hu", path): continue

      print(path)

for root, dirs, files in os.walk('.'):

Now run the modified script — and again, this output will not match yours:

C:projects	ranslation>python scan_pdf.py

.usinessteam.huaok.pdf
.usinessteam.huchn14300-2.pdf
.usinessteam.hudiplom_bardelmeier.pdf
.usinessteam.hudoktor_bardelmeier.pdf
.usinessteam.hufinanzamt_1.pdf

.usinessteam.huzollbescheinigung.pdf
.usinessteam.humondays3.pdf
.usinessteam.humondays4.pdf
.usinessteam.humondays5.pdf

...

How It Works

This example follows the structure set up in the previous examples and adds another test. You can add test after test to create the script that best meets your needs.

In this example, the test looks only for file names (which include the full paths) with an .hu in the name. The assumption here is that files with an .hu in the name (or in a directory with .hu in the name) are translations from Hungarian (hu is the two-letter country code for Hungary). Therefore, this example shows how to narrow the search to files translated from Hungarian. (In real life, you will obviously require different search criteria. Just add the tests you need.)

You can continue refining your script to create a generalized search utility in Python. Chapter 12 goes into this in more depth.

Summary

Text processing scripts are generally short, useful, reusable programs, which are either written for one-time and occasional use, or used as components of a larger data-processing system. The chief tools for the text processing programmer are directory structure navigation and regular expressions, both of which were examined in brief in this chapter.

Python is handy for this style of programming because it offers a balance where it is easy to use for simple, one-time tasks, and it's also structured enough to ease the maintenance of code that gets reused over time.

The specific techniques shown in this chapter include the following:

Use the os.walk function to traverse the file system.
Place the search criteria in the function you write and pass it to the os.walk function.
Regular expressions work well to perform the tests on each file found by the os.walk function.
Try out regular expressions in the Python interpreter interactively to ensure they work.

Chapter 12 covers an important concept: testing. Testing enables you not only to ensure that your scripts work, but that the scripts still work when you make a change.

Exercises

Modify the scan_pdf.py script to start at the root, or topmost, directory. On Windows, this should be the topmost directory of the current disk (C:, D:, and so on). Doing this on a network share can be slow, so don't be surprised if your G: drive takes a lot more time when it comes from a file server). On UNIX and Linux, this should be the topmost directory (the root directory, /).
Modify the scan_pdy.py script to match only PDF files with the text boobah in the file name.
Modify the scan_pdf.py script to exclude all files with the text boobah in the file name.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11. Text Processing

Create new playlist

Sign In

Sign Up

Chapter 11. Text Processing

Why Text Processing Is So Useful

Searching for Files

Clipping Logs

Sifting through Mail

Navigating the File System with the os Module

Note

Working with Regular Expressions and the re Module

Note

Summary

Exercises

Table of Contents for
11. Text Processing