Chapter 11. Text Processing

There is a whole range of applications for which scripting languages like Python are perfectly suited; and in fact scripting languages were arguably invented specifically for these applications, which involve the simple search and processing of various files in the directory tree. Taken together, these applications are often called text processing. Python is a great scripting tool for both writing quick text processing scripts and then scaling them up into more generally useful code later, using its clean object-oriented coding style.

In this chapter you learn:

  • Some of the typical reasons you need text processing scripts

  • A few simple scripts for quick system administration tasks

  • How to navigate around in the directory structure in a platform-independent way, so your scripts will work fine on Linux, Windows, or even the Mac

  • How to create regular expressions to compare the files found by the os and os.path modules

  • How to use successive refinement to keep enhancing your Python scripts to winnow through the data found

Text processing scripts are one of the most useful tools in the toolbox of anybody who seriously works with computer systems, and Python is a great way to do text processing. You're going to like this chapter.

Why Text Processing Is So Useful

In general, the whole idea behind text processing is simply finding things. There are, of course, situations in which data are organized in a structured way; these are called databases and that's not what this chapter is about. Databases carefully index and store data in such a way that if you know what you're looking for, you can retrieve it quickly. However, in some data sources, the information is not at all orderly and neat, such as directory structures with hundreds or thousands of files, or logs of events from system processes consisting of thousands or hundreds of thousands of lines, or even e-mail archives with months of exchanges between people.

When data of that nature needs to be searched for something, or processed in some way, text processing is in its element. Of course, there's no reason not to combine text processing with other data-access methods; you might find yourself writing scripts rather often that run through thousands of lines of log output and do occasional RDBMS lookups (Relational DataBase Management Systems — you learn about these in Chapter 14) on some of the data they run across. This is a natural way to work.

Ultimately, this kind of script can very often get used for years as part of a back-end data processing system. If the script is written in a language like Perl, it can sometimes be quite opaque when some poor soul is assigned five years later to "fix it." Fortunately, this is a book about Python programming, and so the scripts written here can easily be turned into reusable object classes — later, you look at an illustrative example.

The two main tools in your text processing belt are directory navigation, and an arcane technology called regular expressions. Directory navigation is one area in which different operating systems can really wreak havoc on simple programs, because the three major operating system families (UNIX, Windows, and the Mac) all organize their directories differently; and, most painfully, they use different characters to separate subdirectory names. Python is ready for this, though — a series of cross-platform tools are available for the manipulation of directories and paths that, when used consistently, can eliminate this hassle entirely. You saw these in Chapter 8, and you see more uses of these tools here.

A regular expression is a way of specifying a very simple text parser, which then can be applied relatively inexpensively (which means that it will be fast) to any number of lines of text. Regular expressions crop up in a lot of places, and you've likely seen them before. If this is your first exposure to them, however, you'll be pretty pleased with what they can do. In the scope of this chapter, you're just going to scratch the surface of full-scale regular expression power, but even this will give your scripts a lot of functionality.

You first look at some of the reasons you might want to write text processing scripts, and then you do some experimentation with your new knowledge. The most common reasons to use regular expressions include the following:

  • Searching for files

  • Extracting useful data from program logs, such as a web server log

  • Searching through your e-mail

The following sections introduce these uses.

Searching for Files

Searching for files, or doing something with some files, is a mainstay of text processing. For example, suppose that you spent a few months ripping your entire CD collection to MP3 files, without really paying attention to how you were organizing the hundreds of files you were tossing into some arbitrarily made-up set of directories. This wouldn't be a problem if you didn't wait a couple of months before thinking about organizing your files into directories according to artist — and only then realized that the directory structure you ended up with was hopelessly confused.

Text processing to the rescue! Write a Python script that scans the hopelessly nonsensical directory structure and then divide each file name into parts that might be an artist's name. Then take that potential name and try to look it up in a music database. The result is that you could rearrange hundreds of files into directories by, if not the name of the artist, certainly some pretty good guesses, which will get you close to having a sensible structure. From there, you would be able to explore manually and end up actually having an organized music library.

This is a one-time use of a text processing script, but you can easily imagine other scenarios in which you might use a similarly useful script on a regular basis, such as when you are handling data from a client or from a data source that you don't control. Of course, if you need to do this kind of sorting often, you can easily use Python to come up with some organized tool classes that perform these tasks to avoid having to duplicate your effort each time.

Whenever you face a task like this, a task that requires a lot of manual work manipulating data on your computer, think Python. Writing a script or two could save you hours and hours of tedious work.

A second but similar situation results as a fallout of today's large hard disks. Many users store files willy-nilly on their hard disk, but never seem to have the time to organize them. A worse situation occurs when you face a hard disk full of files and you need to extract some information you know is there on your computer, but you're not sure where exactly. You are not alone. Apple, Google, Microsoft, and others all have desktop search techniques that help you search through the data in the files you have collected to help you to extract useful information.

Think of Python as a desktop search on steroids, because you can create scripts with a much finer control over the search, as well as perform operations on the files found.

Clipping Logs

Another common text-processing task that comes up in system administration is the need to sift through log files for various information. Scripts that filter logs can be spur-of-the-moment affairs meant to answer specific questions (such as "When did that e-mail get sent?" or "When was the last time my program log one specific message?"), or they might be permanent parts of a data processing system that evolves over time to manage ongoing tasks. These could be a part of a system administration and performance-monitoring system, for instance. Scripts that regularly filter logs for particular subsets of the information are often said to be clipping logs — the idea being that, just as you clip polygons to fit on the screen, you can also clip logs to fit into whatever view of the system you need.

However you decide to use them, after you gain some basic familiarity with the techniques used, these scripts become almost second nature. This is an application where regular expressions are used a lot, for two reasons: First, it's very common to use a UNIX shell command like grep to do first-level log clipping; second, if you do it in Python, you'll probably be using regular expressions to split the line into usable fields before doing more work with it. In any one clipping task, you may very well be using both techniques.

After a short introduction to traversing the file system and creating regular expressions, you look at a couple of scripts for text processing in the following sections.

Sifting through Mail

The final text processing task is one that you've probably found useful (or if you haven't, you've badly wanted it): the processing of mailbox files to find something that can't be found by your normal Inbox search feature. The most common reason you need something more powerful for this is that the mailbox file is either archived, so that you can access the file, but not read it with your mail reader easily, or it has been saved on a server where you've got no working mail client installed. Rather than go through the hassle of moving it into your Inbox tree and treating it like an active folder, you might find it simpler just to write a script to scan it for whatever you need.

However, you can also easily imagine a situation in which your search script might want to get data from an outside source, such as a web page or perhaps some other data source, like a database (see Chapter 14 for more about databases), to cross-reference your data, or do some other task during the search that can't be done with a plain vanilla mail client. In that case, text processing combined with any other technique can be an incredibly useful way to find information that may not be easy to find any other way.

Navigating the File System with the os Module

The os module and its submodule os.path are one of the most helpful things about using Python for a lot of day-to-day tasks that you have to perform on a lot of different systems. If you often need to write scripts and programs on either Windows or UNIX that would still work on the other operating system, you know from Chapter 8 that Python takes care of much of the work of hiding the differences between how things work on Windows and UNIX.

In this chapter, we're going to completely ignore a lot of what the os module can do (ranging from process control to getting system information) and just focus on some of the functions useful for working with files and directories. Some things you've been introduced to already, and others are new.

One of the difficult and annoying points about writing cross-platform scripts is the fact that directory names are separated by backslashes () under Windows, but forward slashes (/) under UNIX. Even breaking a full path down into its components is irritatingly complicated if you want your code to work under both operating systems.

Furthermore, Python, like many other programming languages, makes special use of the backslash character to indicate special text, such as for a newline. This complicates your scripts that create file paths on Windows.

With Python's os.path module, however, you get some handy functions that will split and join path names for you automatically with the right characters, and they'll work correctly on any OS that Python is running on (including the Mac.) You can call a single function to iterate through the directory structure and call another function of your choosing on each file it finds in the hierarchy. You see a lot of that function in the examples that follow, but first look at an overview of some of the useful functions in the os and os.path modules that you'll be using.

Function Name, as Called

Description

os.getcwd()

Returns the current directory. You can think of this function as the basic coordinate of directory functions in whatever language.

os.listdir(directory)

Returns a list of the names of files and subdirectories stored in the named directory. You can then run os.stat() on the individual files — for example, to determine which are files and which are subdirectories.

os.stat(path)

Returns a tuple of numbers, which give you everything you could possibly need to know about a file (or directory). These numbers are taken from the structure returned by the ANSI C function of the same name, and they have the following meanings (some are dummy values under Windows, but they're in the same places!):

  • st_mode: permissions on the file

  • st_ino: inode number (UNIX)

  • st_dev: device number

  • st_nlink: link number (UNIX)

  • st_uid: userid of owner

  • st_gid: groupid of owner

  • st_size: size of the file

  • st_atime: time of last access

  • st_mtime: time of last modification

  • st_ctime: time of creation

os.path.split(path)

Splits the path into its component names appropriately for the current operating system. Returns a tuple, not a list. This always surprises me.

os.path.join(components)

Joins name components into a path appropriate to the current operating system.

os.path.normcase(path)

Normalizes the case of a path. Under UNIX, this has no effect because file names are case-sensitive; but under Windows, where the OS will silently ignore case when comparing file names, it's useful to run normcase on a path before comparing it to another path so that if one has capital letters, but the other doesn't, Python will be able to compare the two the same way that the operation system would — that is, they'd be the same regardless of capitalizations in the path names, as long as that's the only difference. Under Windows, the function returns a path in all lowercase and converts any forward slashes into backslashes.

os.walk(top, topdown=True, onerror=None, followlinks=False)

This is a brilliant function that iterates down through a directory tree from top-down or bottom-up. For each directory, it creates a 3-tuple consisting of dirpath, dirnames, and filenames. The dirpath portion is a string that holds the path of your directory. Dirnames is a list of subdirectories from dirpath, which exclude '.' and '..'. Lastly, filenames is a listing of every non-directory file in dirpath.

There are more functions where those came from, but these are the ones used in the example code that follows. You will likely use these functions far more than any others in these modules. You can find many other useful functions in the Python module documentation for os and os.path.

Working with Regular Expressions and the re Module

Perhaps the most powerful tool in the text processing toolbox is the regular expression. Though matching on simple strings or substrings is useful, they're limited. Regular expressions pack a lot of punch into a few characters, but they're so powerful that it really pays to get to know them. The basic regular expression syntax is used identically in several programming languages, and you can find at least one book written solely on their use and thousands of pages in other books (like this one).

As mentioned previously, a regular expression defines a simple parser that matches strings within a text. Regular expressions work essentially in the same way as wildcards when you use them to specify multiple files on a command line, in that the wildcard enables you to define a string that matches many different possible file names. In case you didn't know what they were, characters like * and ? are wildcards that, when you use them with commands such as dir on Windows or ls on UNIX, will let you select more than one file, but possiblly fewer files than every file (as does dir win*, which will print only files in your directory on Windows that start with the letters w, i, and n and are followed by anything — that's why the * is called a wildcard). Two major differences exist between a regular expression and a simple wildcard:

  • A regular expression can match multiple times anywhere in a longer string.

  • Regular expressions are much, much more complicated and much richer than simple wildcards, as you will see.

The main thing to note when starting to learn about regular expressions is this: A string always matches itself. Therefore, for instance, the pattern 'xxx' will always match itself in 'abcxxxabc'. Everything else is just icing on the cake; the core of what we're doing is just finding strings in other strings.

You can add special characters to make the patterns match more interesting things. The most commonly used one is the general wildcard '.' (a period or dot). The dot matches any one character in a string; so, for instance, 'x.x' will match the strings 'xxx' or 'xyx' or even 'x.x'.

The last example raises a fundamental point in dealing with regular expressions. What if you really only want to find something with a dot in it, like 'x.x'? Actually, specifying 'x.x' as a pattern won't work; it will also match 'x!x' and 'xqx'. Instead, regular expressions enable you to escape special characters by adding a backslash in front of them. Therefore, to match 'x.x' and only 'x.x', you would use the pattern 'x.x', which takes away the special meaning of the period as with an escaped character.

However, here you run into a problem with Python's normal processing of strings. Python also uses the backslash for escape sequences, because ' ' specifies a carriage return and ' ' is a tab character. To avoid running afoul of this normal processing, regular expressions are usually specified as raw strings, which as you've seen is a fancy way of saying that you tack an 'r' onto the front of the string constant, and then Python treats them specially.

So after all that verbiage, how do you really match 'x.x'? Simple: You specify the pattern r"x.x". Fortunately, if you've gotten this far, you've already made it through the hardest part of coming to grips with regular expressions in Python. The rest is easy.

Before you get too far into specifying the many special characters used by regular expressions, first look at the function used to match strings, and then do some learning by example, by typing a few regular expressions right into the interpreter.

Summary

Text processing scripts are generally short, useful, reusable programs, which are either written for one-time and occasional use, or used as components of a larger data-processing system. The chief tools for the text processing programmer are directory structure navigation and regular expressions, both of which were examined in brief in this chapter.

Python is handy for this style of programming because it offers a balance where it is easy to use for simple, one-time tasks, and it's also structured enough to ease the maintenance of code that gets reused over time.

The specific techniques shown in this chapter include the following:

  • Use the os.walk function to traverse the file system.

  • Place the search criteria in the function you write and pass it to the os.walk function.

  • Regular expressions work well to perform the tests on each file found by the os.walk function.

  • Try out regular expressions in the Python interpreter interactively to ensure they work.

Chapter 12 covers an important concept: testing. Testing enables you not only to ensure that your scripts work, but that the scripts still work when you make a change.

Exercises

  1. Modify the scan_pdf.py script to start at the root, or topmost, directory. On Windows, this should be the topmost directory of the current disk (C:, D:, and so on). Doing this on a network share can be slow, so don't be surprised if your G: drive takes a lot more time when it comes from a file server). On UNIX and Linux, this should be the topmost directory (the root directory, /).

  2. Modify the scan_pdy.py script to match only PDF files with the text boobah in the file name.

  3. Modify the scan_pdf.py script to exclude all files with the text boobah in the file name.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.31.165