CHAPTER 2: Patterns and Regular Expressions

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 2
Patterns and Regular
Expressions

This chapter is a bit of a digression; if you are comfortable with patterns and regular expressions, you can just skip ahead to Chapter 3, where I begin the discussion of shell syntax. However, if you are unfamiliar with patterns and regular expressions, this material turns out to be very important for understanding and illustrating the coming examples. Furthermore, you will have to learn it to be an effective shell programmer, so if you haven't learned it before, start early.

Shell programming is heavily dependent on string processing. The term string is used generically to refer to any sequence of characters; typical examples of strings might be a line of input or a single argument to a command. Users enter responses to prompts, file names are generated, and commands produce output. Recurring throughout this is the need to determine whether a given string conforms to a given pattern; this process is called pattern matching. The shell has a fair amount of built-in pattern matching functionality (especially if you are comfortable with relying on POSIX shell features). Pattern matching is not unique to the shell; other programs, such as find, use the same pattern-matching rules. A special variant of shell pattern matching, called globbing, is used to expand file name patterns into groups of matching names. The distinction between globbing and pattern matching is a bit vague; many people call all patterns globs and use the term file globbing for the special case of matching file names. The shell manual pages, however, tend to call pathname expansion globbing.

Furthermore, many common UNIX utilities, such as grep or sed, provide features for pattern matching. These programs usually use a more powerful kind of pattern matching, called regular expressions. Regular expressions, while different from shell patterns, are crucial to most effective shell scripting. While there is no portable regular expression support built into the shell itself, shell programs rely heavily on external utilities, many of which use regular expressions.

Shell Patterns

Shell patterns are used in a number of contexts. The most common usage is in the case statement (see Chapter 3 for more information). Given two shell variables string and pattern, the following code determines whether text matches pattern:

case $string in

  $pattern) echo "Match" ;;

  *) echo "No match";;

esac

If $string matches $pattern, the shell echoes "Match" and leaves the case statement. Otherwise, it checks to see whether $string matches *. Since * matches anything in a shell pattern, the shell prints "No match" when there was not a match against $pattern. (The case statement only executes one branch, even if more than one pattern matches.)

For exploring pattern matching, you might find it useful to create a shell script based on this. The following self-contained script performs matching tests of a number of words against a pattern:

#!/bin/sh

pattern="$1"

shift

echo "Matching against '$pattern':"

for string

do

  case $string in

  $pattern) echo "$string: Match." ;;

  *) echo "$string: No match." ;;

  esac

done

Save this script to a file named pattern, make it executable (chmod a+x pattern), and you can use it to perform your own tests:

$ ./pattern '*' 'hello'

Matching against '*':

hello: Match.

$ ./pattern 'hello*' 'hello' 'hello, there' 'well, hello'

Matching against 'hello*':

hello: Match.

hello, there: Match.

well, hello: No match.

Remember to use single quotes around the arguments. An unquoted word containing pattern characters such as the asterisk (*) is subject to globbing (sometimes called file name expansion), where the shell replaces such words with any files with names matching the pattern. This can produce misleading results for tests like this. File name patterns are discussed in more detail in the next section.

Pattern-Matching Basics

In a pattern, most characters match themselves, and only themselves. The word hello is a perfectly valid pattern; it matches the word hello, and nothing else. A pattern that matches only part of a string is not considered to have matched that string. The word hello does not match the text hello, world. For a pattern to match a string, two things must be true:

Every character in the pattern must match the string.
Every character in the string must match the pattern.

Now, if this were all there were to patterns, a pattern would be another way of describing string comparison, and the rest of this chapter would consist of filler text like "a ... consists of sequences of nonblank characters separated by blanks," or possibly some wonderful cookie recipes. Sadly, this is not so. Instead, there are some characters in a pattern that have special meaning and can match something other than themselves. Characters that have special meaning in a pattern are called wildcards or metacharacters. Some users prefer to restrict the term wildcard to refer only to the special characters that can match anything. In talking about patterns, I prefer to call them all wildcards to avoid confusion with characters that have special meaning to the shell. Wildcards make those two simple rules much more complicated; a single character in a pattern could match a very long string, or a group of characters in the pattern might match only one character or even none at all. What matters is that there are no mismatches and nothing left over of the string after the match.

The most common wildcards are the question mark (?), which matches any character, and the asterisk (*), which matches anything at all, even an empty string. (If this sounds very wrong, and you think they modify previous characters, you are thinking of regular expressions. Regular expressions, discussed in detail in the "Regular Expressions" section of this chapter, are much more expressive and somewhat more complicated.)

The ? is easy to use in patterns; you use it when you know there will be exactly one character, but you are not sure exactly what it will be. For instance, if you are not sure what accent the user will greet you in, you might use the pattern h?llo, in case your user prefers to write hallo, or hullo. This leaves you with two problems. The first is that users are typically verbose, and write things like hello, there, or hello little computer, or possibly even hello how do i send email. If you just want to verify that you are getting something that sounds a bit like a greeting, you need a way to say "this, or this plus any other stuff on the end."

That is what * is for. Because * matches anything, the pattern hello* matches anything starting with hello, or even just hello with nothing after it. However, that pattern doesn't match the string well, hello because there is nothing in the pattern that can match characters before the word hello. A common idiom when you want to match a word if it is present at all is to use asterisks on both sides of a pattern: *hello* matches a broad range of greetings.

If you want to match something, but you are not sure what it is or how long it will be, you can combine these. The pattern hello ?* matches hello world but does not match hello alone. However, this pattern introduces a new problem. The space character is not special in a pattern, but it is special in the shell. This leads to a bit of a dilemma. If you do not quote the pattern, the shell splits it into multiple words, and it does not match what you expected. If you do quote it, the shell ignores the wildcards. There are two solutions available; the first is to quote spaces, the second is to unquote wildcards. So, you could write hello" "?*, or you could write "hello "?*.

In the contexts where the shell performs pattern matching (such as case statements), you do not need to worry about spaces resulting from variable substitution; the shell doesn't perform splitting on variable substitutions in those contexts. (A disclaimer is in order: zsh's behavior differs here, unless it is running in sh emulation mode. See Chapter 7 for more information.)

Character Classes

The h?llo pattern has another flaw, which is that it is too permissive. While your friends who type with a thick accent will doubtless appreciate your consideration, you might reasonably draw the line at hzllo, h!llo, or hXllo. The shell provides a mechanism for more restrictive matches, called a character class. A character class matches any one of a set of characters, but nothing else; it is like ?, only more restrictive. A character class is surrounded in square brackets ([]), and looks like [characters]. The greeting described previously could be written using a character class as h[aeu]llo. A character class matches exactly one of the characters in it; it never matches more than one character.

Character classes may specify ranges of characters. A typical usage would be to match any digit, with [0-9]. In a range, two characters separated by a hyphen are treated as every character between them in the character set; mostly, this is used for letters and numbers. Patterns are case sensitive; if you want to match all standard ASCII letters, use [a-zA-Z]. The behavior of a range where the second character comes before the first in the character set is not predictable; do not do that. Sometimes, rather than knowing what you do want, you know what you don't want; you can invert a character class by using an exclamation mark (!) as its first character. The character class [!0-9] matches any character that is not a digit. When a character class is inverted, it matches any character not in the range, not just any reasonable or common character; if you write [!aeiou] hoping to get consonants, you will also match punctuation or control characters. Wildcards do not have special meaning in a character class; [?*] matches a question mark or an asterisk, but not anything else.

Character classes are one of the most complicated aspects of shell pattern matching. Left and right square brackets ([]), hyphens (-), and exclamation marks (!) are all special to them. A hyphen can easily be included in a class by specifying it as the last character of the class, with no following character. An exclamation mark can be included by specifying it as any character but the first. (What if there are no other characters? Then you are specifying only one character and probably don't need a character class.) The left bracket is actually easy; include it anywhere, it won't matter. The right bracket (]) is special; if you want a right bracket, put it either at the very beginning of the list or immediately after the ! for a negated class. Otherwise, the shell might think that the right bracket was intended to close the character class. Even apart from the intended feature set, be aware that some shells have plain and simple bugs having to do with right brackets in character classes; avoid them if you can.

If you want to match any left or right bracket, exclamation mark, or hyphen, but no other characters, here is a way to do it:

[][!-]

The first left bracket begins the definition of the class. The first right bracket does not close the class because there is nothing in it yet; it is taken as a plain literal right bracket. The second left bracket and the exclamation mark have no special meaning; neither is in a position where it would have any. Finally, the hyphen is not between two other characters in the class because the right square bracket ends the definition of the character class, so the hyphen must be a plain character.

Many users have the habit of using a caret (^) instead of ! in shell character classes. This is not portable, but it is a common extension some shells offer because habitual users of regular expressions may be more used to it. This can create an occasional surprise if you have never seen it used, and want to match a caret in a class.

Table 2-1. explains the behavior of a number of characters that may have special meaning within a character class, as well as how to include them literally in a class when you want to.

Table 2-1. Special Characters in Character Classes

Character	Meaning	Portability	How to Include It
`]`	End of class	Universal	Put at the beginning of the class (or first after the negation character)
`[`	Beginning of class	Universal	Put it anywhere in the class
`^`	Inversion	Common	Put after some other character
`!`	Inversion	Universal	Put after some other character
`-`	Range	Universal	Put at the beginning or end of the class

Ranges have an additional portability problem that is often overlooked, especially by English speakers. There is no guarantee that the range [a-z] matches every lowercase letter, and strictly speaking there is not even a guarantee that it matches only lowercase letters. The problem is that most people assume the ASCII character set, which defines only unaccented characters. In ASCII, the uppercase letters are contiguous, and the lowercase letters are also contiguous (but there are characters between them; [A-z] matches a few punctuation characters). However, there are UNIX-like systems on which either or both of these assumptions may be wrong. In practice, it is very nearly portable to assume that [a-z] matches 26 lower-case letters. However, accented variants of lowercase letters do not match this pattern. There is no generally portable way to match additional characters, or even to find out what they are. Scripts may be run in different environments with different character sets.

Some shells also support additional character class notations; these were introduced by POSIX but so far are rare outside of ksh (not pdksh) and bash. The notation is [[:class:]], where class is a word like digit, alpha, or punct. This matches any character for which the corresponding C is class() function would return true. For example, [[:digit:]] is equivalent to [0-9]. These classes may be combined with other characters; [[:digit:][:alpha:]_] matches any letter or number or an underscore (_). Additional similar rules use [.name.] to match a special collating symbol. (For instance, some languages might have a special rule for matching and sorting certain combinations of letters, so a ch might sort differently from a c followed by an h) and [=name=] to match equivalence classes, such as a lowercase letter and any accented variant of it.) These rules are particularly useful for internationalized scripts but not sufficiently widely available to be used in portable scripts yet. To avoid any possible misunderstandings, avoid using a left bracket followed immediately by a period (.), equals sign (=), or colon (:) in a character class. Note that this applies only to a left bracket within the character class, not the initial bracket that opens the class; [.] matches a period. (This is more significant in regular expressions, where a period would otherwise have special meaning.)

Character classes are, as you can see, substantially more complicated than the rest of the shell pattern matching rules. Table 2-2 shows the full set.

Table 2-2. Shell Pattern Characters

Pattern	Meaning
`?`	Any character
`*`	Any string (even an empty one)
`[...]`	One character from a class
Anything else	Itself

Using Shell Patterns

Shell patterns are quite powerful, but they have a number of limitations. There is no way to specify repetition of a character class; no shell pattern matches an arbitrary number of digits. You can't make part of a pattern optional; the closest you get to optional components is the asterisk.

Patterns as a whole generally match as much as they can; this is called being greedy. However, if matching too many things with an asterisk prevents a match, the asterisk gives up the extra characters and lets other pattern components match them. If you match the pattern b* to the string banana, the * matches the text anana. However, if you use the pattern b*na, the * matches only the text ana. The rule is that the * grabs the largest number of characters it can without preventing a match. Other pattern components, such as character classes, literal characters, or question marks, get first priority on consuming characters, and the asterisk gets what's left.

Some of the limitations of shell patterns can be overcome by creative usage. One way to store lists of items in the shell is to have multiple items joined with a delimiter; for instance, you might store the value a,b,c to represent a list of three items. The following example code illustrates how such a list might be used. (The case statement, used here, executes code when a pattern matches a given string; it is explained in more detail in Chapter 3.)

list=orange,apple,banana

case $list in

*apple*)        echo "How do you like them apples?";;

esac

How do you like them apples?

This script has a subtle bug, however. It does not check for exact matches. If you try to check against a slightly different list, the problem becomes obvious:

list=orange,crabapple,banana

case $list in

*apple*)        echo "How do you like them apples?";;

esac

How do you like them apples?

The problem is that the asterisks can match anything, even the commas used as delimiters. However, if you add the delimiters to the pattern, you can no longer match the ends of the list:

list=orange,apple,banana

case $list in

*,orange,*)        echo "The only fruit for which there is no Cockney slang.";;

esac

[no output]

To resolve this, wrap the list in an extra set of delimiters when expanding it:

list=orange,apple,banana

case ,$list, in

*,orange,*)        echo "The only fruit for which there is no Cockney slang.";;

esac

The only fruit for which there is no Cockney slang.

The expansion of $list now has a comma appended to each end, ensuring that every member of the list has a comma on both sides of it.

Sometimes, you may find that shell patterns do not have the flexibility to represent what you want. When that happens, you may need to go to regular expressions; see the "Regular Expressions" section at the end of this chapter for more information.

Pathname Expansion

Pathname expansion (the POSIX term), or globbing (what everyone actually calls it), is one of the shell features most users are likely to be at least partially familiar with. The shell has a built-in facility for generating or matching file names. When an unquoted word contains any of the pattern-matching wildcards, it is subject to globbing. In globbing, the shell compares the pattern to files in the file system (using essentially the same pattern matching rules described previously) and expands the word into any matching file names. If there are no matches, the shell leaves the pattern alone. Instead of matching a single specified word against a pattern to produce a single true/false result, globbing matches multiple names and produces all the matches as results. There is, of course, an exception; the find utility uses globbing patterns to match file names but uses them for true/false matches.

Differences from Shell Patterns

Pathname expansion uses the same basic pattern-matching characters as regular shell patterns, but there are a couple of significant differences. When a pathname refers to a file not in the current directory, the full name used is called the path of the file. Each of the pieces of a path, separated by slashes (or possibly by other characters on non-UNIX systems), is called a component. In globbing, each section of a pattern (as divided by path separators) is matched against single components. So, if you wish to match the file bin/unsort, you can specify b*/unsxort, or b*/u*, or bin/*sort, but you cannot just use *unsort. If there are no path separators in a pattern, it matches against files in the current directory; if you are in the bin directory, *sort could match unsort. (Note that there is no portable unsort utility, but writing one makes a great exercise.)

Another way to think about this is that the special characters can never match a path separator; only a literal path separator can match a path separator in a file path. For example, bin[/]unsort does not match bin/unsort. The character class can only match path components, never a path separator. To search in directories with a pattern, you must explicitly include any path separators you wish to match.

If a path starts with a path separator, the path is called an absolute path. Otherwise, it is called a relative path. A relative path name is always interpreted relative to your current directory. In fact, even a file name with no separators is technically a relative path; it is just a very short relative path.

The decision to match only within specified directories may seem surprising, but it makes good sense. Given that a typical UNIX system can easily have hundreds of thousands of files, it is quite simply impractical to try to match against all of them; the desktop system on which I ran most of my test scripts has a bit over three and a half million files on it. The requirement to match directories explicitly is probably a good idea. (The zsh shell, however, offers glob-bing extensions to let you do crazy things like this if you want. They are not generally portable, though.)

Pathname expansion, like pattern expansion, is aggressive about trying to find a match. Many UNIX systems sort some binaries into both /usr/bin and /usr/sbin. Sometimes it is not obvious which directory a program would be in. While the idiomatic solution is to use which file to find a copy of a file in your execution path, this doesn't help if you've forgotten the exact name of the utility. The glob pattern /usr/*bin/*stat matches any file in either /usr/bin or /usr/sbin with a name ending in stat. When expanding each component, the shell makes a list of possible matches, then compares all of these to the next component. If one of the components never ends up producing any matches, it is discarded completely. There is one subtle difference, having to do with components, between globbing and pattern matching. In a UNIX path, // is always equivalent to /; however, a shell pattern like a/*/b does not match a/b. You cannot match an empty component with a pattern because there is never actually an empty component.

Wildcards never match a component with a name starting with a period (.). These files, called dot files, are not matched by patterns and are usually not displayed to the user; they are often called hidden files. This is not the same way in which some other systems allow a file to be tagged as being invisible. You can see and manipulate these files in most programs; they just don't get displayed in lists by default or matched by globs. This applies to all the components in a path, not just file names. Note that a period has no special meaning except as the first character of a file name, and even then the meaning is purely one of convention. UNIX file names may have as many (or as few) periods in them as they want. Some programs assign special meaning to suffixes starting with a period, but most UNIX programs give no special interpretation to the name of a file. The pattern *.name does not match a file named .name; the period in the pattern is not at the beginning of the pattern, so it can't match a period at the beginning of a file name.

CASE SENSITIVITY IN PATHNAME EXPANSION

Systems differ in their handling of letters in different cases. On a traditional UNIX system, files named readme and README can exist in the same directory because the names of files are case-sensitive; that is to say, capital and lowercase letters are distinct. Other systems have used two other conventions. Some file systems (most notably, the traditional MS-DOS FAT16 file system) store all names without reference to case. This policy is often called case-insensitive. On these systems, not only are README and readme the same name, there is no way to know which of them was used to create a file.

Some systems, most notably the Macintosh and Amiga, introduced a new (well, it was new in the 80s, and UNIX doesn't change much) policy called case-preserving. On a case-preserving file system, the exact name used to create a file is preserved in the file system, but matches against file names are typically case-insensitive. Thus you can see that the file was named ReadMe when it was created, but if you try to open a file named rEADmE, you get the same file anyway. This behavior is also quite common on the more modern (well, relatively speaking) FAT32 file system used by Windows 95, and commonly used on flash drives or external hard drives. However, it is dependent on the "long name support" introduced in that era, and some devices (such as cameras) may fail spectacularly to recover gracefully if a file's name uses this feature.

For the most part, the UNIX shell is totally unaware of this, which can be a major source of surprises when using a case-preserving file system. The most common case-preserving file systems in use today are the native ones of Windows and Macintosh machines. Since OS X is a UNIX system these days, and many users expect shell scripts to run in the various UNIX-like emulation environments available under Windows, this may impact your scripts some day.

Some shells may offer extra options to provide for pathname expansion that ignores case. With shells that do not, you have to be aware of the potential issues. Even if the shell handles this well, though, utility programs may or may not do so reliably. Some programs may scan a directory looking for matching names before trying to open a file, end up failing to see the file, and possibly later overwriting it. This is unusual, but not unheard of. Your best allies in this are experienced users, who are typically familiar with the case handling of their system and reasonably careful about it.

A common pitfall for users coming from DOS environments is to think that the pattern *.* should match any file. However, this convention relies on the distinction between a file's name and the characters after the period, called the extension. UNIX has no such distinction, and a file whose name does not contain a literal period (.) does not match this pattern. This pattern also does not match dot files. It is not enough to match the period literally; the period must be the first character in the relevant path component to match against a dot file.

In some cases, pathname expansion will not detect files that can be accessed explicitly by name. There are three cases where this may apply. The first is case-sensitivity issues (see the previous sidebar). The second is that some network disk services provide directories only when they are explicitly requested; echo * lists only those directories that are in use, not the ones that could be in use if you asked for them.

Finally, globbing relies on the ability to read directories, while access to files relies only on the execute permission bit. This is a reasonably arcane distinction, which most people rarely encounter. Normally, directories give neither or both read and execute permission to any given user. However, it is possible to grant execute permission alone to a directory. This might be useful, for instance, in a public file server, allowing people to access files by name, but not to obtain a listing of files. Globbing requires the ability to read the directory to obtain the list of files against which a glob pattern is matched; without that, no file ever matches a glob.

Some shells offer an additional kind of pathname expansion called brace expansion. This is not portable to standard shells, but this does not mean you can safely ignore it; it means that, in some cases, file names with patterns like {a,b} will not behave as you expect them to. Brace expansion is discussed in Chapter 7. It does not affect file names expanded through pathname expansion, or the results of parameter expansion, so you do not need to worry about it when interacting with generated file names.

Using Globs

All of the previous discussion is pretty useful, but it can be a bit hard to get a feel for how to use globs without a few examples. This section introduces a few of the most common shell pattern idioms and explains how each of them works; it also gives some key advice about using globs effectively, both interactively and in scripts.

The pattern .??* matches any file beginning with a period and following it with at least two characters; this is used to match dot files in a given directory. This pattern is constructed to match files with names beginning with a period (.), but exclude the two special directory entries. and .. (which match the current and parent directory, respectively). You might think that, since the initial period has to be matched explicitly, you could use .?*, but the second period in .. is not special and can be matched by a question mark. This pattern does not catch files with names like .a or .b, which can be a problem.

To match any file with a name ending in .png or .gif, use a pattern like *.[pg][ni][gf]. In fact, this pattern also matches a number of other possible names, but luckily the number of clashes is low. (This problem gets worse if you try to match many more file suffixes.) Patterns like this are useful in cases where you can think of two or three likely file name suffixes that might be in use, but you are not sure all of them will be in use. If you have a directory containing a number of PNG files (using the common suffix) but no GIF files, and use the pair of patterns *.png and *.gif, the second pattern matches no files, and is left untouched. By contrast, the pattern *.[pg][ni][gf] matches all the PNG files and is replaced by their names, even though there are no GIF files.

A similar technique is often used for case-insensitive file name matching; for instance, you might use *.[Tt][Xx][Tt] to match files with a .txt suffix. By convention, when using sets of character classes like this, you should use the same position in each class for a given component. Thus [pg][ni][gf] suggests png and gif to the reader; if you wrote [gp][ni][gf], people would think you were aiming for gng and pif.

Files with really long names often lend themselves to abbreviation using a wildcard expected to match only one file's name. This is probably one of the most common sources of crazy or unplanned behavior in interactive usage; be careful when picking the patterns you use! It is very easy to get thrown off by an * unexpectedly matching a very long string, or an empty string, when you were looking at a particular part of a path name. This can be done across multiple directories, as well; a Mac user might spell /System/Library/LaunchDaemons as /S*/L*/L*ons. Anchoring the first and last characters of a file name often narrows down the field very quickly.

Wildcards can also be used to avoid shell metacharacters without quoting; for instance, a file named a;b can be referred to as a?b, as long as there are no other files matching the pattern. The use of ? as a fill-in for spaces or other special shell characters is idiomatic.

EXERCISE CAUTION

Be careful with wildcards. Typos can create horrible problems. One of the most common typos I've seen (and made, repeatedly), is to try to remove .o files (created by the C compiler) and end up typing rm *>o. This removes every file (except dot files) in the current directory and redirects its output (which is usually empty) into a file named o. This typo may seem unusual, but the * is a shifted key on most US keyboards and so is >. Just remember: There is no undo button. Whenever you're about to type an rm command, especially an rm-f, be sure to check the command line out to make sure you haven't made any crucial typos. Do not alias rm to rm -i; this is a horrible habit, which breaks a lot of useful scripted features. Worse, it will make you careless. A poor-quality safety net is worse than no safety net at all.

Regular Expressions

A comprehensive review of regular expressions is too much to fit into a single chapter. Whole books have been written on the topic. This section provides a basic grounding in regular expressions, covering the main features of the most common varieties. Regular expressions are primarily used by programs other than the shell, although many shells have a built-in version of some command (typically expr) that uses them. However, they are not used in portable shell syntax. (Some shells offer relevant exceptions, discussed in Chapter 7.) The term regular expression is often abbreviated to either regexp or regex. While regexp is clearer to read, regex is pronounceable; the plural is regexes (or regexps, which is still unpronounceable). I use the abbreviation here for brevity.

There are two primary varieties of regexes; basic regexes (often called BREs) and extended regexes (EREs). Each uses slightly different rules. The basic regex syntax is actually slightly more powerful than the extended syntax, but it is harder to write clearly and concisely.Many implementations offer additional features bolted on to either of these, making it hard to be sure exactly which features are portable. What's worse, not everyone implements the official POSIX standard for regexes, so you cannot necessarily rely on the standard. The default in most tools is to provide basic regexes with at least a few extensions, which may be documented.

In addition to the traditional forms of regexes, there are other variants. The Perl programming language introduced a number of additional features, which have become popular and widely used. Many programs other than Perl now provide "Perl-compatible regular expressions," thanks to the efforts of the kind people at www.pcre.org. There are other pattern matching languages available, such as Lua's patterns, some of which are much simpler than regexes.

In any discussion of regexes, credit must be given to Henry Spencer's regular expression library, released long enough ago that free software was a relatively new concept. Before POSIX even existed, Henry Spencer wrote an essentially compatible clone (not derived from AT&T source) of the V8 UNIX regexp() family of library functions. While most systems now provide standard library functions to make regexes available to most programs, this was not the case back then, and many programs offer regex support in the first place only because the Henry Spencer regex library made it possible. It offered what were essentially extended regexes (and still does in a few programs, I'm sure). This code was written in 1986 and is still found in a few modern systems in compatibility libraries.

Basic Regular Expressions

Regexes are most famously used by the grep utility; its name is derived from the ed editor's usage g/regular-expression/p, meaning "global search for regular-expression and print." In fact, there are often several varieties of the grep utility on a system, and it may support more than one variety of regex; this can be a portability problem if you depend on one of the extensions. Toward the end of the chapter, Table 2-8 shows the common variants you are likely to encounter and where they are likely to be found. As with most tools, check the documentation and any available standards, don't just test behavior on a given system. This section begins with a discussion of basic regexes, then goes on to cover extended regexes. Some newer software now uses extended regexes by default, and behavior can vary surprisingly. However, the most common utilities (grep, expr, sed) default to basic regexes. Because of this, I start with basic regexes, then go on to a description of the differences between extended and basic regexes; it mostly boils down to putting a backslash in front of anything cool in a basic regex. This reverses the usual sense of backslash as suppressing special meanings.

Unlike shell patterns, regexes are considered to have matched if there exists a matching string anywhere in the string being matched, even if it does not fill the whole line; this is similar to the behavior of a shell pattern with a * on each end. You can override this by anchoring the regex, tying it to the beginning of the line with a leading ^ or to the end of the line with a trailing $. The shell pattern hello is equivalent to the regex ^hello$. In some cases, a regex is implicitly anchored; for instance, the expr utility's colon (:) operator matches a regex against the beginning of a string.

In regexes, the character that matches anything is period (.), not question mark (?). So, if you want to match multiple greetings, you'd use h.llo as a regex, not h?llo. Character classes are essentially the same, except that regexes use ^, not !, to negate a character class. (Some shells support this syntax in character classes as well, as an extension.) Support for the POSIX [[: class:]] feature (and the related = name= and . name. features) is slightly more common in regex implementations than it is in shells, but it is still not portable enough to rely on.

You may have noticed that ^ has two different meanings in regexes. The regex ^[0–9] matches a digit at the beginning of a string; the regex [^0–9] matches any character but a digit anywhere in a string. Many seemingly intractable regex problems have turned out to be typos closely related to this.

Where regexes really begin to differ from shell patterns is in the handling of *. In shell patterns, the asterisk itself is capable of matching parts of a string. In a regex, it modifies the previous character. The regex apples* matches either apple or apples (or applesssss, for that matter). Instead of matching something in addition to the preceding s, the * modifies the s. The * is called a repetition operator; it repeats something else, rather than matching anything itself. If you want the behavior of a shell pattern *, it is spelled .* in regexes; that matches any number of any character. Note that the repetition operator repeats the previous matching construct; .* can match any number of different characters, not just the same character over and over.

In fact, the * operator doesn't really operate on characters. It operates on indivisible chunks of regex, called atoms. A character is always an atom because there is no way to match just part of it. Another way to create an atom is to group things manually, using parentheses. Material between ( and ) is called a subexpression, and is matched as a single unit. For instance, the expression ba(na)* can match ba, bana, banana, or bananana, but it cannot match banan. The n and a have been grouped into an atom. Character classes and the period are also atoms. When an atom is repeated, it is possible for it to match a different thing each time. The regex [aeiou]* can match any string of vowels; each repetition of the atom is checked separately.

The same rules that allow a subexpression to join multiple characters into an atom allow multiple subexpressions to be joined; subexpressions can be nested. Good examples of nested subexpressions are rare in basic regexes; the best uses for them rely on additional operators not provided in historic implementations of basic regexes.

The more general repetition operator is {x,y}, indicating a repetition of between x and y copies of the preceding character; if y is omitted leaving only {x,}, any number of copies greater than or equal to x are matched. If the comma is also omitted, exactly x copies are matched. Thus {x} is precisely equivalent to {x,x}.

The majority of what you need to know to write basic regexes can be summed up with a list of atoms and a list of repetition operators, as shown in Tables 2-3 and 2-4.

Table 2-3. Basic Regular Expression Atoms

Atom	Description
`.`	Match any character
`[...]`	Character class
`(...)`	Subexpression
Anything else	Individual characters are atoms

So, for instance, in the regex ab*, there are two atoms (a and b), and the repetition operator * modifies the second atom. In (ab*)c, there is a subexpression consisting of two atoms and a repetition operator, and the whole subexpression is itself an atom. Repetition operators are not atoms; they operate on atoms. An atom followed by a repetition operator is not an atom anymore. If you want to make an atom containing a repetition operator, you must wrap it in parentheses to create a subexpression.

Table 2-4. Repetition Operators in Basic Regular Expressions

operator	Meaning
`*`	Zero or more
`{` x `}`	Exactly x
`{` x, `}`	At least x
`{` x, y `}`	Between x and y, inclusive

Backreferences

There is one other thing, which is neither an atom nor a repetition operator. In a basic regex, a backslash followed by a single digit is a special construct called a backreference. As the name suggests, a backreference is a reference to something earlier in the regex. When a group is parenthesized, it becomes a subexpression. The backreference 1 refers to the first subexpression. Unlike a repetition operator, a backreference refers to the matching string rather than the matching expression. So .{2} matches any two characters, but (.)1 matches only two of the same character. Backreferences are extremely powerful, and some edge conditions exist.

Backreferences are counted by open parentheses, not closed parentheses; given the expression ((ab)*c)*, 1 refers to the outer subexpression and 2 to the inner subexpression. It is not at all clear what should happen if you write ((b)*2), and use of nested subexpressions and backreferences within subexpressions is probably not safe or portable.

Using backreferences is a bit tricky. Very few regexes really need backreferences; in fact, they are omitted in extended regexes (though some implementations offer them as an extension). Even worse, their performance can be incredibly bad; a carefully crafted regex with many subexpressions and backreferences can take seconds or even minutes to match against a string, even on ludicrously fast modern hardware.

Extended Regular Expressions

Extended regexes (often called EREs) are much more powerful than basic regexes in some ways, but weaker in others. They are most prominently associated with the egrep utility. One of the most obvious differences is the simplification of syntax; parentheses used for grouping, and braces used for repetition, do not need backslashes in extended regexes. There are several possible ways to get a literal open brace, but the only portable one is [{]. (More on this in the "Common Extensions" section.)

Extended regexes offer two additional repetition operators, ? and +. The ? operator is equivalent to {0,1}, and the + operator is equivalent to {1,}. Both offer greatly improved readability, even though they do not offer new functionality.

One of the most significant enhancements of extended regexes is the alternation operator (|). This is usually pronounced "or," not "pipe," because it is the symbol used for logical or bitwise or operations in some languages. In an extended regex, a|b matches either a or b. This operator has a low precedence (lower than the joining of adjacent atoms), so hello|goodbye matches either hello or goodbye, not hellooodbye or hellgoodbye. Furthermore, it applies to atoms including subexpressions, which combines with nested subexpressions to make for a number of interesting patterns. The extended regex ((0[1−9])|(1[12]))? matches any number from 01 to 12, or an empty string. Patterns like this can be used to check for somewhat more structured data than can easily be checked for with basic regexes.

Extended regexes do not have backreferences (although many implementations offer them as an extension). They do have subexpressions, though. See Table 2-5 for the list of ERE atoms.

Table 2-5. Extended Regular Expression Atoms

Atom	Description
`.`	Match any character
`[...]`	Character class
`(...)`	Subexpression
Anything else	Individual characters are atoms

The repetition operators are similar, although there are more of them, as shown in Table 2-6.

Table 2-6. Repetition Operators in Basic Regular Expressions

Operator	Meaning
`*`	Zero or more
`?`	Zero or one
`+`	One or more
`{` x `}`	Exactly x
`{` x, `}`	At least x
`{` x, y `}`	Between x and y, inclusive

The interaction between the alternation operator and other components can be a bit confusing; even experienced programmers sometimes forget how it works. Table 2-7 illustrates how to use it.

Table 2-7. Alternation and Atoms

Expression	Meaning
`a\|b`	a or b
`good\|bad`	good or bad
`c\|hat`	c or hat
`(c\|h)at`	cat or hat
`a\|b{2}`	a or bb
`(a\|b)c`	ac or bc
`(a)\|(b)c`	a or bc
`(a\|b){2}`	aa, ab, ba, or bb

The case in which I have most often gotten confused with alternation is the difference between (expr1)|(expr2) and (expr1|expr2). These are, in fact, completely interchangeable, as long as you are not going to refer back to the subexpression later and as long as you don't have any other text in your pattern. If there is other text, though, they are different. Consider the following example:

(h[eu]llo)|(good(bye| night)) (world|moon)

It is pretty obvious what this is doing; it's matching any of four statements ("hello" or "hullo" or "goodbye" or "good night"), followed by either "world" or "moon." Unfortunately, while this is obvious, it is also wrong. In fact, it can match either "hello" or "hullo" with nothing following them. The | between the hello and goodbye subexpressions is dividing the whole expression; the space before (world|moon) is not special in any way in a regex, so it just continues extending the subpattern on the right side of the |. In terms of Table 2-7, this is actually (a)|(b)c, not (a|b)c.

Common Extensions

A number of extensions to both basic and extended regexes are quite common. Many implementations of basic regexes allow ? and + as synonyms for the extended regex ? and + repetition operators. Some also allow alternation using |. Similarly, some implementations of extended regexes support backreferences. Another very popular extension is the special pseudo-anchors < and >, which match the beginning and end of a word; these may be found in both basic and extended regex implementations. Some systems spell these instead as [[:<:]] and [[:>:]]. Historical egrep did not support { as a literal open brace, but many modern implementations do. The POSIX standard specifies that a { not followed by a digit is also literal, but do not rely on this; even if computers always understood it, programmers would not.

Most modern systems tend to offer a sort of hybrid mode in which extended regexes support backreferences, and basic regexes support at least a few of the extended regex operators. On some systems, a plain ? may work even in an alleged basic regex. Text editors that support regexes are particularly likely to offer strange hybrid feature sets.

In terms of portability, nearly every system has some programs that support extended regexes, but many programs provide BREs by default, or exclusively, for compatibility reasons. Table 2-8 lists a few of the most common programs that support regexes of one variety or another.

Table 2-8. Regular Expression Support

Program	Regex Type	Notes
`awk`	Extended	Also true of `awk` variants, such as `gawk` or `mawk`
`emacs`	Basic	Also supports `?` and `+` (without backslashes) and `\|` as a synonym for ERE `\|`
`expr`	Basic	Some versions may offer `?`.
`sed`	Basic	Very few versions support `?`.
`grep`	Basic	See also `egrep`.
`egrep`	Extended	Most commonly known variant; also known as `grep −E` on some systems.
`fgrep`	N/A	Does not actually use regexes; matches fixed strings only.
`vi`	Basic	`nvi` has an option to switch to extended REs; `vim` supports `?` and `+`.

Replacements

As has been previously pointed out, patterns are usually implicitly anchored to the ends of a string; to match a pattern anywhere in a string, you must write * pattern*. Regexes, by contrast, are not usually anchored. There is a particularly important reason for this; it is often desirable to be able to replace the matching text with something else. The most common place this is encountered in scripting is in sed's s/pattern/replacement/ operator. This finds any chunk of a string matching pattern and replaces it with replacement. If the pattern were implicitly anchored and had to start and end with .* to match text in the middle of a string, replacements would always replace the whole string. This is not usually what you want.

In general, replacement text allows some reference back to the matched string. In general, there are two ways to do this; one is by using N to refer to subexpressions, much like a back-reference. The other is to use & (or & in a few programs) to refer to the entire matched string. The sed substitution operator allows repeated matches, each starting from immediately past the previous match, with the g suffix; s/./&-/g replaces word with w-o-r-d-.

Elaborate replacement strings using subexpressions are one of the places where the simpler syntax of extended regexes is the most rewarding. It is fairly tedious to type a pattern with multiple subexpressions. Consider this simple pattern for replacing Random, John Q. with John Q. Random:

s/([^ ]{1,}), ([^ ]{1,}) ([^ ]{1,})/2 3 1/

s/([^ ]+), ([^ ]+) ([^ ]+)/2 3 1/

The extended regex is quite a bit shorter and easier to read. Note that while extended regexes may not support backreferences, replacements using extended regexes typically support references to subexpressions.

Using Regular Expressions

Regular expressions are mostly found in external utilities (although some shells may implement expr as a built in for performance reasons). Because of this, in cases where you can use a shell pattern instead of a regex, it may be more efficient to use the shell's built-in pattern matching, such as the case statement, instead of using an external utility. When using POSIX shells, the pattern-matching parameter substitutions (discussed in Chapter 7) make it even easier to get a lot done without needing regexes.

The expr utility offers a fairly flexible regex feature; expr string : pattern performs a regex match of string against pattern. In this case, the regex is implicitly anchored to the beginning of the string, as though it had a leading ^; to bypass this, start your pattern with .*. The value produced by expr depends on whether pattern has subexpressions. If there is at least one parenthesized subexpression, expr prints the contents of 1, or an empty string if there is no match. Otherwise, expr prints the length of the match, or 0 if there is no match:

$ expr foobar : foo

3

$ expr foobar : '(foo)'

foo

Unlike grep, expr does not consider a zero-length match to be a success; to grep (and most editors), the pattern b* matches the word hello because the word hello contains zero or more repetitions of the letter b. To expr, only a match of at least one character is a real match.

One use of the expr utility is extracting parts of file names. A pair of common utilities, basename and dirname, allow you to extract part of the name of a file from its path. These utilities are not completely portable, but you can do the same thing with expr:

$ expr /path/to/file : '(.*)/[^/]*'

/path/to

$ expr /path/to/file : '.*/([^/]*)'

file

Each of these expressions matches the same string; an arbitrarily long string of any characters whatsoever, followed by a slash and then any string of characters other than slashes. The difference is in which part of this pattern is marked as a subexpression; in the first pattern, it is the material before the slash, and in the second, it is the material after the slash. One weakness of expr is that you can only use it to extract the first subexpression of a regex. If you need to use a subexpression for grouping before the material you want, you will have to do something more elaborate to extract the desired text. However, in the most common cases, you can get what you want.

The preceding example assumes there is always a slash in the expression. What if there isn't?

$ expr filename : '(.*)/[^/]*'



$ expr filename : '.*/([^/]*)'

The expression doesn't match because there's no slash. So, of course, the thing to do is make the slash optional:

$ expr filename : '(.*)/{0,1}[^/]*'

filename

$ expr filename : '.*/{0,1}([^/]*)'

This doesn't work either. The second result might surprise you, but with the slash made optional, the .* on the left end of the expression can match the whole string; there is nothing to force it to leave any characters for the subexpression on the right to consume. In practice, you have to use another layer of testing to determine whether there is a slash before trying to split the string around it. (More advanced pattern-matching tools, such as the pcre library, could do this in one pass.)

Regexes are one of the most powerful tools of the UNIX system. With experience and practice, they become second nature; nothing is so maddening as a program where searching does not support regular expressions. The biggest problem users tend to have early on is confusing regexes with patterns; there seems to be no cure for this but practice and habit. In general, patterns are used only in the shell and in file name matching; everything else uses regexes. The equivalences are simple enough, and anything complicated in a regex generally cannot be done with a shell pattern to begin with. The hard part is getting the habit for which one to use when.

Something that might help you develop a feel for the differences between patterns and regexes is to run some tests and experiment. The following script shows how different strings do, or do not, match against patterns and regexes. (An explanation of how this script works will have to wait for a couple of chapters.)

#!/bin/sh

pattern="$1"

shift

for string

do

  if expr "$string" : ".*$pattern" >/dev/null 2>&1; then

    echo "regex: $string matched $pattern."

  else

    echo "regex: $string didn't match $pattern."

  fi

  case $string in

  $pattern) echo "shell: $string matched $pattern.";;

  *) echo "shell: $string didn't match $pattern.";;

  esac

done

To use this script, save it in a file and mark it as executable (chmod +x filename). Run it with at least two arguments; the first is a pattern you wish to test, and the second and later arguments are strings you wish to see matched against the pattern. Here's a sample:

$ ./patcheck '*' aardvark

regex: aardvark didn't match *.

shell: aardvark matched *.

Be aware that this script does not try to anchor regexes for you, and it even suppresses the default anchoring on the left provided by expr. If you want to compare only against anchored regexes, change the expr line to read as follows:

if expr "$string" : "$pattern$" >/dev/null 2>&1; then

Regexes offer a number of improvements over shell patterns. The repetition operators allow for much more specific tests for common patterns, such as a string of unknown length containing only digits; the regex [0−9]* simply can't be expressed correctly in shell patterns. You can, however, use the pattern *[!0−9]* to detect any string that does not contain only digits.

Many utilities default to basic regexes, but optionally accept extended regexes. For the most part, if you haven't got a specific reason to think otherwise, any given program probably uses basic regexes as a default, usually with some extensions. More tips on managing the diversity of utility behaviors may be found in Chapter 8.

Replacing Patterns with Regular Expressions

Mechanically, it's quite easy to replace a pattern with a comparable regular expression. What is not so easy is getting the shell to use regexes in these places. The following discussion assumes some familiarity with statements and control structures, which are explained in the following chapters; you can come back to it later if too much of it is unfamiliar.

The two primary uses of shell patterns are file name matching and case statements. Replacing globs with regexes is not always easy. In the simplest case, you can use ls and grep together to generate a list. If you want a list of all files whose names have only digits in them before a particular suffix, such as .txt, you can express this as follows:

$(ls | grep '^[0−9]*.txt$')

The ls command, when running in a pipeline, lists each file name on a separate line by default; the grep command then shows only the lines matching the given regex. The $() construct (explained in Chapter 5; not portable to a few older shells) substitutes the output of this command, split into words. For files not necessarily in the current directory, this can be harder, and you may need to use the find command.

The case statement is hard to replace idiomatically. My advice is to replace it with a series of if and elif statements. Because only one branch of a case statement can match, these statements should be nested:

if expr "$1" : "$2" >/dev/null 2>&1; then

  echo "$2"

elif expr "$1" : "$3" >/dev/null 2>&1; then

  echo "$3"

elif expr "$1" : "$4" >/dev/null 2>&1; then

  echo "$4"

elif expr "$1" : "$5" >/dev/null 2>&1; then

  echo "$5"

else

   echo "no match"

fi

Another option, which may be more expressive in some cases, is to use regexes (and substitution) to generate a new string that is more amenable to pattern matching. Imagine that you wished to check for each of four flags, as in the previous example:

matches=""

expr "$1" : "$2" > /dev/null 2>&1 && matches="2$matches"

expr "$1" : "$3" > /dev/null 2>&1 && matches="3$matches"

expr "$1" : "$4" > /dev/null 2>&1 && matches="4$matches"

expr "$1" : "$5" > /dev/null 2>&1 && matches="5$matches"



case $matches in

*2*) echo "$2";;

*3*) echo "$3";;

*4*) echo "$4";;

*5*) echo "$5";;

*) echo "no match";;

esac

While this structure separates the matching operation into two passes, it preserves the semantics of the case statement precisely. On the down side, it does require processing all four tests before evaluating any of them.

Common Pitfalls of Regular Expressions

The two most common problems with regexes are matching too much and matching too little. In particular, it is extremely easy to be surprised when a .* matches nothing, and you expected it to match something, or to be surprised when it matches everything.

Some time ago, I wrote a script in which I intended to reverse the first two words of a line:

sed −e 's/([^ ]*) ([^ ]*)/2 1/'

This did exactly what I expected; it selected everything up to the first space, and the next block of spaces, and reversed them. But then I wanted it to keep doing this to additional pairs, so I modified it:

sed −e 's/([^ ]*) ([^ ]*)/2 1/g'

This seemed to work, but then I tried it on another system, and it didn't seem to work at all. While a b became b a, a b c d became b ac d. (In fact, there was a trailing space after this, which I did not initially notice.) In fact, "buggy" system was correct. The first iteration matches a b. The second matches an empty string of nonspaces, a space, and the letter c, and reverses them. Because I "knew" that my intent in writing [^ ]* was to match the largest available series of non-words, I forgot that the regex takes the first match it can find, matching as much as it can, not the longest match it can find no matter where it has to start to make that match. Interestingly, several systems had a bug, which caused them to skip that first character in this circumstance and "correctly" do what I wanted. (The bug seems to have been an unusual edge condition.)

Forgetting anchors or including extra anchors are both common mistakes made when trying to match something specific. Just during the time I've been working on this book, I've been bitten several times by the fact that expr anchors regexes implicitly to the beginning of the string.

When you have an expression that could be seen as matching a string in more than one way, the general rule is that the leftmost expressions are greedy first. So, if part of a string could go in either of two subexpressions, it will be in the leftmost one.

The distinctions between basic and extended regexes are another common source of confusion. If you have been using one heavily, and you switch to the other, all sorts of things go wrong. Subexpressions become literal parentheses, and vice versa; both are confusing. There is no such thing as a nontrivial regex that can be used both as a basic and an extended regex. If you have two editors, one that uses each syntax, expect to spend a lot of time puzzling over warnings about invalid repetition operators and unmatched parentheses, or wondering why a search didn't turn something up that is right there in the page.

What's Next?

The ability to decide which of several pieces of code to execute, or to execute code repeatedly, is essential to programming. Chapter 3 introduces the basic control structures that make the shell into a programming language rather than a mere macro language, as well as some of the tools the shell provides for the creation and manipulation of data files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 2: Patterns and Regular Expressions

Create new playlist

Sign In

Sign Up

CHAPTER 2Patterns and RegularExpressions