Chapter 10

Regular Expressions

Chapter 9 focused on Linux. Chapter 10 also focuses on a Linux-related topic, that of regular expressions. Regular expressions provide a powerful tool for Linux users and administrators. With regular expressions, a user can search through text files not for specific strings but for strings that fit a particular pattern of interest. Linux offers the grep program, which performs such a search task given a regular expression. In this chapter, regular expressions are introduced along with numerous examples and an examination of grep. Because of the challenging nature in learning regular expressions, the reader should be aware that mastery of them only comes with an extensive experience and that this can be a difficult chapter to read and understand. It is recommended that the reader try out many of these examples. The chapter also examines the use of Bash wildcards.

The learning objectives of this chapter are to

  • Describe regular expressions and why they are useful.
  • Illustrate the use of each regular expression metacharacter.
  • Provide numerous examples of regular expressions.
  • Examine the grep program.
  • Describe the use of wildcards in Bash and show how they differ from regular expressions.
  • Combine ls and grep through redirection.

Consider a string of characters that contains only 1s followed by 0s, for instance, 111000, 100, and 10000. A regular expression can be used to specify such a pattern. Once written, a regular expression can be compared to a collection of strings and return those that match the pattern. A regular expression is a string that combines literal characters (such as 0 or 1) with metacharacters, symbols that represent options. With metacharacters, you can specify, for instance, that a given character or set of characters can match “any number of” or “at least one” time, or specify a list of characters so that “any one character” matches.

Regular expressions can be highly useful to either a user or system administrator when it comes to searching for files or items stored in files. In this chapter, we examine how to define regular expressions and how to use them in Linux. The regular expression is considered so useful that Linux has a built-in program called grep (global regular expression print), which is an essential tool for Linux users. Wildcards, a form of regular expressions, are available in Linux as well, although these are interpreted differently from their usage in regular expressions. Regular expressions have been built into some programming languages used extensively in Linux such as perl.

Let us consider two simple examples to motivate why we want to explore regular expressions. First, you, as a user, have access to a directory of images. Among the images are jpg, gif, png, and tiff formatted images. You want to list all of those under the tiff format. However, you are unsure whether other users will have named the files with a .tiff, .tif, .TIFF, .TIF, .Tiff, or .Tif extension. Rather than writing six different ls statements, or even one ls statement that lists each possible extension, you can use a regular expression. Second, as a system administrator, you need to search a directory (say /etc) for all files that contain IP addresses as you are looking to change some hardcoded IP addresses, but you do not remember which files to examine. A regular expression can be defined to match strings of the form #.#.#.#, where each # is a value between 0 and 255. In creating such a regular expression and using grep, you can see all of the matches using one command rather than having to examine dozens or hundreds of files.

In this chapter, we first examine the metacharacters for regular expressions. We look at dozens of examples of regular expressions and what they might match against. Then, we look at how some of the characters are used as wildcards by the Bash interpreter. This can lead to confusion because * has a different meaning when used as a regular expression in a program such as grep versus how the Bash interpreter uses it in an instruction such as ls. Finally, we look at the grep program and how to use it. Regular expressions can be a challenge to apply correctly. Although in many cases, their meaning may be apparent, they can often confound users who are not familiar with them. Have patience when using them and eventually you might even enjoy them.

Metacharacters

There is a set of characters that people use to describe options in a pattern. These are known as metacharacters. Any regular expression will comprise literal characters and metacharacters (although a regular expression does not require metacharacters). The metacharacter * means “match the preceding character 0 or more times”; so, for instance, a* means “zero or more a’s”. The regular expression 1010 matches only 1010 as it has no metacharacters. Since we will usually want our regular expressions to match more than one specific string, we will almost always use metacharacters. Table 10.1 provides the list of metacharacters. We will explore each of these in turn as we continue in this section.

Table 10.1

Regular Expression Metacharacters

Metacharacter

Explanation

*

Match the preceding character if it appears 0 or more times

+

Match the preceding character if it appears 1 or more times

?

Match the preceding character if it appears 0 or 1 time

.

Match any one character

^

Match if this expression begins a string

$

Match if this expression ends a string

[chars]

Match if the expression contains any of the chars in []

[chari-charj]

Match if the expression contains any characters in the range from chari to charj (e.g., a–z, 0–9)

[[:class:]]

An alternative form of [] where the :class: can be one of several categories such as alpha (alphabetic), digit, alnum (alphabetic or numeric), punct, space, upper, lower

[^chars]

Match if the expression does not contain any of the chars in []

The next character should be interpreted literally, used to escape the meaning of a metacharacter, for instance $ means “match a $”

{n}

Match if the string contains n consecutive occurrences of the preceding character

{n,m}

Match if the string contains between n and m consecutive occurrences of the preceding character

{n,}

Match if the string contains at least n consecutive occurrences of the preceding character

{,m}

Match if the string contains no more than m consecutive occurrences of the preceding character

|

Match any of these strings (an “OR”)

(…)

The items in … are treated as a group, match the entire sequence

We will start with the most basic of the symbols: * and +. To use either of these, first specify a character to match against. Then place the metacharacter * or + after the character to indicate that we expect to see that character 0 or more times (*) or 1 or more times (+).

For instance, 0*1* matches any string of zero or more 0s followed by zero or more 1s. This regular expression would match against these strings: 01, 000111111, 1, 00000, 0000000001, and the empty string. The empty string is a string of no characters. This expression matches the empty string because the * can be used for 0 matches, so 0*1* matches a string of no 0s and no 1s. This example regular expression would not match any of the following: 10, 00001110, 0001112, 00a000, or abc. In the first and second cases, a 0 follows a 1. In the other three cases, there are characters in the string other than 0 and 1.

The regular expression 0+1+ specifies that there must be at least one 0 and one 1. Thus, this regular expression would not match the empty string; neither would it match any string that does not contain one of the two digits (e.g., it would not match 0000 or 1). This expression would match 01, 00011111, and 000000001. Like 0*1*, it would not match a string that had characters other than 0 or 1 (e.g., 0001112), nor would it match a string in which a 0 followed a 1 (e.g., 00001110).

We can, of course, combine the use of * and + in a regular expression, as in 0*1+ or 0+1*. We can also specify literal characters without the * or +. For instance, 01* will match against a 0 followed by zero or more 1s—so, for instance, 0, 01, 01111, but not 1, 111, 1110, or 01a. Although * and + are the easiest to understand, their usefulness is limited when just specified after a character. We will find that * and + are more useful when we can combine them with [] to indicate a combination of repeated characters.

The ? is a variant, like * or +, but in the case of ?, it will only match the preceding character 0 or 1 time. This allows you to specify a situation where a character might or might not be expected. It does not, however, match repeatedly (for that, you would use * or +). Recall the .tiff/tif example. We could specify a regular expression to match either tiff or tif as follows: tiff?. In this case, the first three characters, tif, are expected literally. However, the last character, f, may appear 0 or 1 time. Although this regular expression does not satisfy strings such as TIFF (i.e., all upper-case letters), it is a start. Now, with ?, *, and +, we can control how often we expect to see a character, 0 or 1 time, 0 or more times, 1 or more times.

Unlike * and +, the ? places a limit on the number of times we expect to see a character. Therefore, with ?, we could actually enumerate all of the combinations that we expect to match against. For instance, 0?1? would match against only four possible strings: 0, 1, 01, and the empty string. In the case of 0*1*, there are an infinite number of strings that could match since “0 or more” has no upper limit.

Note that both the * and ? are used in Linux commands like ls as wildcards. In Bash and Wildcards, we will learn that in such commands, their meaning differs from the meanings presented here.

The . (period) can be used to match any single character. For instance, b.t could match any of these strings bat, bet, bit, but, bot, bbt, b2t. It would not match bt, boot, or b123t. The . metacharacter can be combined with *, +, and ?. For instance, b.*t will match any string that starts with a b, is followed by any number of other characters (including no characters) and ending with t. So, b.*t matches bat, bet, bit, but, bot, bbt, b2t, bt, boot, b123t, and so forth. The expression b.+t is the same except that there must be at least one character between the b and the t, so it would match all of the same strings except for bt. The regular expression b.?t would match bt or anything that b.t matches. The question mark applies to the . (period). Therefore, . is applied 0 or 1 time; so this gives us a regular expression to match either bt or b.t. It would not match any string that contains more than one character between the b and the t.

The next metacharacter is used to specify a collection or a list. It starts with [, contains a list, and ends with]. For example, [aeiou] or [123]. The idea is that such a pattern will match any string that contains any one of the items in the list. We could, for instance, specify b[aeiou]t, which would match any of bat, bet, bit, bot, and but. Or, we could use the [] to indicate upper versus lower case spelling. For instance, [tT] would match either a lower case t or an upper case T.

Now we have the tools needed to match any form of tif/tiff. The following regular expression will match any form of tif or tiff using any combination of lower- and upper-case letters: [tT][iI][fF][fF]?. The ? only applies to the fourth [] list. Thus, it will match either an upper- or lower-case t, followed by an upper- or lower-case i, followed by an upper- or lower-case f, followed by zero or one lower- or upper-case f.

The list specified in the brackets does not have to be a completely enumerated list. It could instead be a range such as the letters a through g. A range is represented by the first character in the range, a hyphen (-), and the last character in the range. Permissible characters for ranges are digits, lower-case letters, and upper-case letters. For instance, [0-9] would mean any digit, whereas [a-g] would mean any lower-case letter from a to g. That is, [a-g] is equivalent to [abcdefg]. You can combine an enumerated list of characters and a range, for instance [b-df-hj-np-tv-z] would be the list of lower-case consonants.

As an alternative to an enumerated list or range, you can also use the double brackets and a class. For instance, instead of [a-zA-Z], you could use [[:alpha:]], which represents the class of alphabetic characters. There are 12 standard classes available in the Linux regular expression set, as shown in Table 10.2. A nonstandard class is [[:word:]], which consists of all of the alphanumeric characters plus the underscore.

Table 10.2

Regular Expression Classes

Class

Meaning

[[:alnum:]]

Alphanumeric—alphabetic character (letter) or digit

[[:alpha:]]

Alphabetic—letter (upper or lower case)

[[:blank:]]

Space or tab

[[:cntrl:]]

Any control character

[[:digit:]]

Digit

[[:graph:]]

Any visible character

[[:lower:]]

Lower-case letter

[[:print:]]

Any visible character plus the space

[[:punct:]]

Any punctuation character

[[:space:]]

Any whitespace (tab, return key, space, backspace)

[[:upper:]]

Upper-case letter

[[:xdigit:]]

Hexadecimal digit

The list, as specified using [] or [[]], will match any single character if found in the string. If you wanted to match some combination of characters in a range, you could add *, +, ., or ? after the brackets. For instance, [a-z]+ means one or more lower-case letters.

Imagine that you wanted to match someone’s name. We do not know if the first letter of the person’s name will be capitalized but we expect all of the remaining letters to be in lower case. To match the first letter, we would use [A-Za-z]. That is, we expect a letter, whether upper or lower case. We then expect some number of lower-case letters, which would be [a-z]+. Our regular expression is then [A-Za-z][a-z]+. Should we use * instead of + for the lower-case letters? That depends on whether we expect someone’s name to be a single letter. Since we expect a name and not an initial, we usually would think that a name would be multiple letters. However, we could also use [A-Za-z][a-z]* if we think a name might be say J.

What would [A-Za-z0-9]* match? This expression will match zero or more instances of any letter or digit. This includes the empty string (as * includes zero occurrences), any single letter (upper or lower case) or digit, or any combination of letters and digits. So each of these would match:

abc ABC aBc a12 1B2 12c 123456789 aaaaaa 1a2b3C4D5e

So what would not match? Any string that contained characters other than the letters and digits. For instance, a_b_c, 123!, a b c (i.e., letters and blank spaces), and a1#2B%3c*fg45.

Notice with the [] that we can control what characters can match, but not the order that they should appear. If we, for instance, require that an ‘a’ precede a ‘b’, we would have to write them in sequence using two sets of brackets, such as [aA][bB] to indicate an upper- or lower-case ‘a’ followed by an upper- or lower-case ‘b’. We could also allow any number of them using [aA]+[bB]+ (or use * instead of +). This can become complicated if we want to enforce some combination followed by another combination. Consider that we want to create a regular expression to match any string of letters such that there is some consonant(s) followed by some vowel(s) followed by consonant(s). We could use the following regular expression:

[b-df-hj-np-tv-z]+[aeiou]+[b-df-hj-np-tv-z]+

Can we enforce a greater control on “1 or more”? The * and + are fine when we do not care about how many instances might occur, but we might want to have a restriction. For instance, there must be no more than five, or there must be at least two. Could we accomplish this using some combination of the ? metacharacter? For instance, to indicate “no more than five”, could we use “?????”? Unfortunately, we cannot combine question marks in this way. The first question mark applies to a character, but the rest of the question marks apply to the preceding question mark.

We could, however, place a character (the period for “any character”) followed by a question, and repeat this five times as in:

.?.?.?.?.?

This regular expression applies each question mark to a period. And since the question mark means “0 or 1”, this is the same as saying any one, two, three, four, or five characters. But how do we force the characters to be the same character? Instead, what about [[:visible:]]?[[:visible:]]?[[:visible:]]?[[:visible:]]?[[:visible:]]? Unfortunately, as with the period, the character in each [[:visible:]] can be any visible character, but not necessarily the same character.

Our solution instead is to use another metacharacter, in this case, {n,m}. Here, n and m are both positive integers with n less than m. This notation states that the preceding character will match between n and m occurrences. That is, it is saying “match at least n but no more than m of the preceding character.” You can omit either bound to enforce “at least” and “no more than”, or you can specify a single value to enforce “exactly”.

For instance, 0{1,5}1* would mean “between one and five 0s followed by any number of 1s whereas [01]{1,5} means “between one and five combinations of 0 and 1”. In this latter case, we would not care what order the 0s and 1s occur in. Therefore, this latter expression will match 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, up to five total characters.

We would use {2,} to indicate “at least two” and {,5} to indicate “no more than five”. With the use of {n,m}, we can now restrict the number of matches to a finite number. Consider 0{5}1{5}. This would match only 0000011111. However, [01]{5} would match any combination of five 0s and 1s.

It should be noted that the use of {} is only available when you are using the extended regular expression set. The program grep, by default, does not use the extended set of metacharacters. To use {} in grep, you would have to use extended grep. This is either egrep or grep –E. We will see this in more detail in The grep Program.

Let us combine all of the ideas we have seen to this point to write a regular expression that will match a social security number. The social security number is of the form ###-##-####, where each # is a digit. A regular expression to match such a number is given below:

[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]

This regular expression requires a digit, a digit, a digit, a hyphen, a digit, a digit, a hyphen, a digit, a digit, a digit, and a digit. Using {n}, we can shorten the expression to:

[0-9]{3}-[0-9]{2}-[0-9]{4}

What would a phone number look like? That depends on whether we want an expression that will match a phone number with an area code, without an area code, or one that could match a phone number whether there is an area code or not. We will hold off on answering this question for now and revisit it at the end of this section.

Let us try something else. How would we match an IP address? An IP address is of the form 0-255.0-255.0-255.0-255. The following regular expression is not correct. Can you figure out why?

[0-255].[0-255].[0-255].[0-255]

The IP address regular expression has two flaws; the first one might be obvious, the second is a bit more obscure. What does [0-255] mean? In a regular expression, you use the [] to indicate a list of choices. Choices can either be an enumerated list, such as [abc], or a range, such as [a-c]. The bracketed lists for this regular expression contain both a range and an enumerated list. First, there is the range 0–2, which will match 0, 1, or 2. Second, there is an enumerated list 5, 5. Thus, each of the bracketed items will match any of 0, 1, 2, 5, or 5. So the above regular expression would match 0.1.2.5 or 5.5.5.5 or 0.5.1.2. What it would not match are either 0.1.2.3 (no. 3 in the brackets) or 10.11.12.13 (all of the brackets indicate a digit, not a multicharacter value such as 13).

So how do we specify the proper enumerated list? Could we enumerate every number from 0 to 255? Not easily, and we would not want to, the list would contain 256 numbers! How about the following: [0-9]{1,3}. This expression can match any single digit from 0 to 9, any two digit numbers from 00 to 99, and any three-digit numbers from 000 to 999. Unfortunately, we do not have an easy way to limit the three-digit numbers to being 255 or less, so this would match a string such as 299.388.477.566. But for now, we will use this notation. So, let us rewrite our expression as

[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}

The second flaw in our original regular expression is the use of the . (period). Recall that . means “match any one character”. If we use a period in our regular expression, it could match anything. So, our new regular expression could match 1.2.3.4 or 10.20.30.40 or 100.101.201.225, but it could also match 1a2b3c4 or 1122334 or 1-2-3-4, and many other sequences that are not IP addresses. Our problem is that we do not want the period to be considered a metacharacter; we want the period to be treated literally.

How do we specify a literal character? For most characters, to treat it literally, we just list it. For instance, abc is considered the literal string “abc”. But if the character itself is a metacharacter, we have to do something special to it. There are two possibilities, the first is to place it in []. This is fine, although it is not common to place a single character in [], so [0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3} would work. Instead, when we want to specify a character that so happens to be one of the metacharacters, we have to “escape” its meaning. This is done by preceding the character with a , as in . or + or {or $. So, our final answer (for now) is

[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}

We will try to fix the flaw that permits three-digit numbers greater than 255 in a little while.

There are other uses of the escape character, these are called escape sequences. Table 10.3 provides a listing of common escape sequences. For instance, if you want to find four consecutive white spaces, you might use s{4}. Or if you want to match any sequence of 1 to 10 non-digits, you could specify D{1,10}.

Table 10.3

Common Escape Sequences

d

Match any digit

D

Match any non-digit

s

Match any white space

S

Match any non-white space

w

Match any letter (a-z, A-Z) or digit

W

Match any non-letter/non-digit



Match a word boundary

B

Match any non-word boundary

The [] has another usage, although it can be a challenge to apply correctly. If you place a ^ before the enumerated list in the [], the list is now interpreted as meaning “match if none of these characters are present”. You might use [^0-9]* to match against a string that contains no digits or [^A-Z]* to match a string that contains no upper-case letters. The expression [A-Z][a-z]+[^A-Z]* states that a string is expected to have an upper-case letter, some number of lower-case letters but no additional upper-case letters. This might be the case if we expect a person’s name, as we would not expect, for instance, to see a name spelled as ZaPPa.

Why is [^…] challenging to use? To explain this, we must first reexamine what regular expressions match. Remember that a regular expression is a string used to match against another string. In fact, what the regular expression will match is a substring of a larger string. A substring is merely a portion of a string. For instance, if the string is “Frank Zappa”, any of the following would be considered a substring: “Frank”, “ank”, “Zappa”, “k Z”, “ppa”, “Frank Zappa”, and even “” (the empty string).

Consider the expression 0{1,2}[a-zA-Z0-9]+. This regular expression will match any string that consists of one or two 0s followed by any combination of letters and digits. Now consider the following string:

0000abcd0000

As we have defined regular expressions earlier, this string should not match the expression because it does not have “one or two 0s”, it literally has four 0s. However, the expression is only looking to match any substring of the string that has “one or two 0s followed by letters and digits”. Since the string contains “0a”, it matches. The regular expression does not need to match every character in the string; it only has to find some substring that does match.

Returning to the usage of [^…], let us look at an example. Consider [^A-Z]+. The meaning of this expression seems clear: match anything that does not contain capital letters. Now consider a string abCDefg. It would appear that the regular expression should not match this string. But the regular expression in fact says “do not match upper-case letters” but the string also contains lower-case letters. Therefore, the regular expression provided will match ab and efg from the string, so the string is found to match. What use is [^A-Z]+ then if it matches a string that contains upper-case letters? The one type of string this regular expression will not match is any string that only consists of upper-case letters. So, although it matches abCDefg, it would not match ABCDEFG. To make full use of [^…], we have to be very careful.

With this in mind, consider 0+1+. This will match 0001, 01111, 01, but it will also match 0101, and 000111aaa because these two strings do contain a sequence that matches 0+1+. So how can we enforce that the match should only precisely match the expression? For this, we need two additional metacharacters, ^ and $. The ^, as seen above, can be used inside of [] to mean “match if these are not found”. But outside of the [], the ^ means to “match at the start of the string” and $ means to “match at the end of the string”. If our regular expression is of the form ^expression$, it means to match only if the string is precisely of this format.

We might want to match any strings that start with numbers. This could be accomplished through the regular expression ^[0-9]+. We use + instead of *, because * could match “none of these”, so it would match strings that do or do not start with digits. We might want to match any strings that end with a state abbreviation. All state abbreviations are two upper-case letters, so this would look like [A-Z][A-Z]$, or alternatively [A-Z]{2}$. If we wanted the state abbreviation to end with a period, we could use [A-Z][A-Z].$ and if we wanted to make the period optional, we could use [A-Z][A-Z].?$ Notice that using [A-Z][A-Z], we are also matching any two uppercase letters, so for instance AB and ZZ, which are not legal state abbreviations.

In general, we do not want to use both ^ and $ in an expression because it would overly restrict matching. To demonstrate the concepts covered so far, let us consider a file containing employee information. The information is, row by row, each employee’s last name, first name, position, year of hire, office number, and home address.

Let us write a regular expression to find all employees hired in a specific year, say 2010. Our regular expression could just be 2010. However, just using 2010 could lead to erroneous matches because the string 2010 could appear as a person’s office number or as part of an address (street address, zip code).

What if we want to find employees hired since 2000? A regular expression for this could be 20[01][0-9]. Again, this could match as part of an office number or address. If we use ^20[01][0-9]$ as our solution, we restrict matches to lines that consist solely of four-digit numbers between 2000 and 2019. No lines will match because no lines contain only a four-digit number. We must be more clever than this.

Notice that the year hired follows a last name, first name, and position. We could represent last name as [A-Z][a-z]+. We could represent first name as [A-Z][a-z]* (assuming that we permit a letter such as ‘J’ for a first name). We could represent a position as [A-Za-z]+, that is, any combination of letters. Finally, we would expect the year of hire, which in this case should be 20[01][0-9]. If each of the pieces of information about the employee is separated by commas, we can then construct our expression so that each part ends with a comma. See Figure 10.1. Notice that year of hire, as shown here, includes years in the future (up through 2019).

Figure 10.1

Image of Representing “Year Hired Since 2000.

Representing “Year Hired Since 2000.”

We could also solve this search problem from the other end of the row. The year hired will occur before an office number and an address. If we assume an office number will only be digits, and that an address will be a street address, city, state, and zip code, then to end the line, we would expect to see 20[01][0-9], [0-9]+, and an address. The address is tricky because it consists of several different components, a street address, a city, a state abbreviation (two letters), and a zip code. The street address itself might be digits followed by letters and spaces (e.g., 901 Pine Street) or it might be more involved. For instance, there may be periods appearing in the address (e.g., 4315 E. Magnolia Road), or it might include digits in the street name (e.g., 50 N. 10th Street). It could also have an apartment number that uses the # symbol (such as 242 Olive Blvd, #6). We could attempt to tackle all of these by placing every potential letter, digit, symbol, and space in one enumerated list, as in [A-Za-z0-9.#]+. The city name should just be letters (although it could also include spaces and periods) but we will assume just letters. The state is a two-letter abbreviation in upper-case letters, and the zip code should be a five-digit number (although it could also be in the form 12345-6789). Figure 10.2 represents the entire regular expression that could be used to end a string that starts with a year of hire of at least 2000, but only with a five-digit zip code (we see how to handle the nine-digit zip code below).

Figure 10.2

Image of A possible regular expression to match a street address

A possible regular expression to match a street address.

The last metacharacters are | and (). The use of these metacharacters is to provide sequences of characters where any single sequence can match. That is, we can enumerate a list of OR items. Unlike the enumerated list in [], here, we are enumerating sequences. Consider that we want to match any of these three 2-letter abbreviations: IN, KY, OH. If we used [IKO][NYH], it would match IN, KY, and OH, but it would also match IY, IH, KN, KH, ON, and OY. We enumerate sequences inside () and places | between the options. So, (IN|KY|OH) literally will only match one of these three sequences.

With the use of () and |, we can provide an expression to match both five-digit and nine-digit zip codes. The five-digit expression is [0-9]{5}. The nine-digit expression is [0-9]{5}-[0-9]{4}. We combine them using | and place them both in (). This gives us ([0-9]{5}|[0-9]{5}-[0-9]{4}).

Similarly, we can use () and | to solve the IP address problem from earlier. Recall that our solution using [0-9]{1,3} would match three-digit numbers greater than 255, and thus would match things that were not IP addresses. We would not want to enumerate all 256 possible values, as in (0|1|2|3|…|254|255), but we could use () and | in another way. Consider that [0-9] would match any one-digit number, and IP addresses can include any one-digit number. We could also use [0-9][0-9] because any sequence of 00-99 is a legitimate IP address, although we would not typically use a leading zero. So, we could simplify this as [0-9]{1,2}. However, this does not include the sequence 100–255. We could express 100–255 as several different possibilities:

1[0-9][0-9]—this covers 100–199
2[0-4][0-9]—this covers 200–249
25[0-5]—this covers 250–255

Figure 10.3 puts these options together into one lengthy regular expression (blank spaces are inserted around the “|” symbols to make it slightly more readable).

Figure 10.3

Image of A solution to match legal IP addresses

A solution to match legal IP addresses.

Let us wrap up this section with a number of examples. First, we will describe some strings that we want to match and come up with the regular expressions to match them. Second, we will have some regular expressions and try to explain what they match. Assume that we have a text file that lists student information of the following form:

Student ID (a 16 digit number), last name, first name, major, minor, address.

Majors and minors will be three-letter codes (e.g., CSC, CIT, BIS, MIN) all in capital letters. A minor is required, but minors can include three blank spaces to indicate “no minor selected”. The address will be street address, city, state, zip code.

We want to find all students who have majors of either computer science (CSC) or computer information technology (CIT). The obvious answer is to just search for (CSC|CIT). However, this does not differentiate between major and minor, we only want majors. Notice that the major follows the 16-digit number and the name. So, for instance, we might expect to see 0123456789012345, Zappa, Frank, CSC, … We could then write the regular expression starting at the beginning of the line:

^[0-9]{16}, [A-Z][a-z]+, [A-Z][a-z]+, (CSC|CIT)

We could shorten this. Since the minor is always preceded by a major, which is a capitalized three-letter block, what we expect to see before the major, but not the minor, is a first name, which is not fully capitalized. So we could reduce the expression as follows:

[A-Z][a-z]+, (CSC|CIT)

Here, we are requiring a sequence of an upper-case letter followed by lower-case letters (a name) followed by a comma followed by one of CSC or CIT. If the CSC or CIT matched the minor, the string preceding it would not include lower-case letters, and if the upper-case letter followed by lower-case letters matched a last name or street address, it would not be followed by either CSC or CIT.

We want to find all students who live in apartments. We make the assumption that apartments are listed in the address portion using either apt, apt., #, or apartment. We can use [.]? (or .?) to indicate that the period after apt is optional. We can enumerate these as follows:

([Aa]pt[.]?|[Aa]partment|#)

We want to find all students who live in either Kentucky or Ohio. We will assume the state portion of the address is abbreviated. That would allow us to simply specify:

(KY|OH)

We would not expect KY or OH to appear anywhere else in an address. However, it is possible that a major or minor might use KY or OH as part of the three-letter abbreviation. If we wanted to play safe about this, we would assume that the state appears immediately before a zip code, which ends the string. For this, we could specify:

(KY|OH), [0-9]{5}$

or if we believe there will be a five-digit zip code and a four-digit extension, we could use:

(KY|OH), ([0-9]{5}|[0-9]{5}-[0-9]{4})$

Alternatively, since the state abbreviation will appear after a blank, and if it ends with a period, we could add the blank and period so that three-letter majors and minors will not match:

(KY.| OH.)

If we had a major, say SKY or OHM, it would not match these state abbreviations because of the space and period that surround the two letters.

Spam Filters and Regular Expressions

So you want to build a spam filter to filter out unwanted e-mail. It is a simple task to write a program that will search through an e-mail (text file) for certain words: “Cash”, “Act Now!”, “Lose Weight”, “Viagra”, “Work at home”, “You’ve been selected”, and so forth. But spammers have fought back by attempting to disguise keywords.

Consider a spam e-mail advertising cheap and available Viagra tablets. The e-mail may attempt to disguise the word Viagra under a number of different forms: V1agra, V.i.a.g.r.a, Vi@ gra, V!agra, ViaSEXYgra, and so forth. But regular expressions can come to our rescue here.

If we are only worried about letter replacements, we could try to enumerate all possible replacements in our regular expression, as in

[Vv][iI!][Aa@][gG9][Rr][aA@]

What about a version where, rather than a common replacement (for instance, 3 for ‘e’ or 1 or ! for ‘i’), the replacement character is unexpected? For instance, V##gra or Viag^^? Here, we have to be more careful with our regular expression. We could, for instance, try [Vv].{2}gra, [Vv]iag.{2}, and other variants, but now we have to be careful not to block legitimate words. For instance, [Vv]ia.{3} could match viable.

What about a version of the word in which additional characters are inserted, like V.i.a.g.r.a., ViaSEXYgra, or Via##gra? To tackle this problem, we can insert the notation .* in between letters. Recall that . means “any character” and * means “0 or more of them”. So, V.*i.*a.*g.*r.*a.* would match a string that contains the letters ‘V’, ‘i’, ‘a’, ‘g’, ‘r’, ‘a’ no matter what might appear between them. Because of the *, we can also match strings where there is nothing between those letters.

Without regular expressions, spam filters would be far less successful. But, as you can see, defining the regular expressions for your filter can be challenging!

We want to find all students whose zip codes start with 41. This one is simple:

41[0-9]{3}

However, this could also match part of a student number (for instance, 01234123456789012), or it is possible that it could match someone’s street address. Since we expect the zip code to end the string, we can remedy this with

41[0-9]{3}$

Or if we might expect the four-digit extension

(41[0-9]{3}|41[0-9]{3}-[0-9]{4})$

We want to find any student whose ID ends in an even number. We could not just use [02468] because that would match any string that contains any of those digits anywhere (in the ID number, in the street address, in the zip code). But student numbers are always 16 digits, so we want only the 16th digit to be one of those numbers.

[0-9]{15}[02468]

We could precede the expression with ^ to ensure that we are only going to match against student IDs (in the unlikely event that a student’s street address has 16 or more digits in it!)

One final example. We want to find students whose last name is Zappa. This seems simple:

Zappa

However, what if Zappa appears in the person’s first name or address? Unlikely, but possible. The last name always appears after the student ID. So we can ensure this matching the student ID as well: ^[0-9]{16}, Zappa

Let us look at this from the other side. Here are some regular expressions. What will they match against?

([^O][^H], [0-9]{5}$|[^O][^H], [0-9]{5}-[0-9]{4}$)

We match anything that does not have an OH, followed by a five- or nine-digit number to end the string. That is, we match any student who is not from OH.

[A-Z]{3},[]{4},

This string will find three upper-case letters followed by a comma followed by four blanks (the blank after the comma, and then no minor, so three additional blanks), followed by a comma. We only expect to see three upper-case letters for a major or a minor. So here, we are looking for students who have a major but no minor.

^0{15}[0-9]

Here, we are looking for a string that starts with 15 zeroes followed by any other digit. If student ID numbers were assigned in order, this would identify the first 10 students.

Bash and Wildcards

Recall from Chapter 9 that the Bash interpreter performs multiple steps when executing an instruction. Among these steps are a variety of expansions. One of note is filename expansion. If a user specifies a wildcard in the filename (or pathname), Bash unfolds the wildcard into a list of matching files or directories. This list is then passed along to the program. For instance, with ls *.txt, the ls command does not perform the unfolding, but instead Bash unfolds *.txt into the list of all files that match and then provides the entire list to ls. The ls command, like many Linux commands, can operate on a single item or a list equally.

Unfortunately, the characters used to denote the wildcards in Bash are the same as some of the regular expression metacharacters. This can lead to confusion especially since most users learn the wildcards before they learn regular expressions. Since you learned the wildcards first, you would probably interpret the * in ls *.txt as “match anything”. Therefore, ls *.txt would list all files that end with the .txt extension. The idea of collecting matches of a wildcard is called globbing. Recall that expansion takes place in a Bash command before the command’s execution. Thus, ls *.txt first requires that the * be unfolded. The unfolding action causes the Bash interpreter to enumerate all matches. With *.txt, the matches are all files whose names end with the characters “.txt”. Note that the period here is literally a period (because the period is not used in Bash as a wildcard). Therefore, the ls command literally receives a list of all filenames that match the pattern.

As discussed in Metacharacters, the * metacharacter means “match the preceding character(s) zero or more times.” In the case of ls *.txt, the preceding character is a blank space. That would be the wrong interpretation. Therefore, as a Linux user or administrator, you must be able to distinguish the usage of wildcard characters as used in Bash to perform globbing from how they are used as regular expressions in a program such as grep.

The wildcard characters used in Bash are *, ?, +, @, !, , [], [^…], and [[…]]. Some of these are included in the regular expression metacharacter set and some are not. We will examine the usage of the common wildcards in this section. Table 10.4 provides an explanation for each. Note that those marked with an a in the table are wildcards that are only available if you have set Bash up to use the extended set of pattern matching symbols. As we will assume this has not been set up in your Bash environment, we will not look at those symbols although their meanings should be clear.

Table 10.4

Bash Wildcard Characters

*

Matches any string, including the null string

**

Matches all files and directories

**/

Matches directories

?

Matches any single character (note: does not match 0 characters)

+

Matches one or more occurrences (similar to how it is used in regular expressions)a

@

Matches any one of the listed patternsa

!

Matches anything except one of the list patternsa

Used to escape the meaning of the given character as with regular expressions, for instance * means to match against an *

[…]

Matches any of the enclosed characters, ranges are permitted when using a hyphen, if the first character after the [is either a – or ^, it matches any character that is not enclosed in the brackets

{…}

As with brace expansion in Bash, lists can be placed in {} to indicate “collect all”, as in ls {c,h}*.txt , which would find all txt files starting with a c or h

[[:class:]]

As with regular expressions, matches any character in the specified class

a Only available if you have set Bash up to use the extended set of pattern matching symbols.

We finish this section with some examples that use several of the wildcard symbols from Table 10.4. We will omit the wildcards that are from the extended set. For this example, assume that we have the following files and subdirectories in the current working directory. Subdirectories are indicated with a / before their name.

foo	foo.txt	foo1.txt	foo2.dat	foo11.txt	/foo3	/fox	/foreign
/foxr	FOO	FOO.txt	FOO1.dat	FOO11.txt	foo5?.txt	/FOO4

See Table 10.5, which contains each example. The table shows the Linux ls command and the items from the directory that would be returned.

Table 10.5

Examples

ls Command

Items Returned

ls *.txt

foo.txt, foo1.txt, foo11.txt, FOO.txt, FOO11.txt, foo5?.txt

ls *.*

foo.txt, foo1.txt, foo2.dat, foo11.txt, FOO.txt, FOO1.dat, FOO11.txt, foo5?.txt

ls *

Will list all items in the directory

ls foo?.*

foo1.txt, foo2.dat

ls foo??.*

foo11.txt, foo5?.txt

ls *?.*

foo5?.txt

ls *.{dat,txt}

Will list all items in the directory that end with either.txt or.dat

ls foo[0-2].*

foo1.txt, foo2.dat

ls *[[:upper:]]*.txt

FOO11.txt, FOO.txt

ls *[[:upper:]]*

FOO, FOO11.txt, FOO1.dat, FOO.txt,/FOO4

ls *[[:digit:]]*

Will list every item that contains a digit

ls foo[[:digit:]].*

foo1.txt, foo2.dat (it does not list foo11.txt because we are only seeking 1 digit, and it does not list foo5?.txt because we do not provide for the ? after the digit and before the period)

The grep Program

The grep program searches one or more text files for strings that match a given regular expression. It prints out the lines where such strings are found. In this way, a user or administrator can quickly obtain lines from text files that match a desired pattern. We hinted at this back at the end of the section on Metacharacters when we looked at regular expressions as used to identify specific student records in a file. As grep can operate on multiple files at once, grep returns two things for each match, the file name and the line that contained the match. With the –n option, you can also obtain the line number for each match. When you use grep, depending on your regular expression and the strings in the file, the program could return the entire file if every line matches, a few lines of the file, or no lines at all.

The grep program uses the regular metacharacters as covered in Metacharacters. It does not include what are called the extended regular expression set, which include {,}, |. However, grep has an option, grep –E (or the program egrep), which does use the extended set. So, for the sake of this section, we will use egrep throughout (egrep and grep –E do the same thing).

The grep/egrep program works like this:

grep pattern filename(s)

If you want to use multiple files, you can either use * or ? as noted above in part 2, or you can list multiple file names separated by spaces. If your pattern includes a blank, you must enclose the pattern in ‘’ or “” marks. It is a good habit to always use ‘’ or “” in your regular expressions as a precaution.

In fact, the use of ‘’ is most preferred. This is because the Bash interpreter already interprets several characters in ways that grep may not. Consider the statement grep !! filename. This statement seems straightforward, search filename for the characters !!. Unfortunately though, !! signals to the Bash interpreter that the last instruction should be recalled. Imagine that instruction was cd ~. Since the Bash interpreter unfolds such items as !! before executing the instruction, the instruction changes to grep cd ~ filename. Thus, grep will search filename for the sequence of characters cd ~.

Another example occurs with the $. You might recall that the $ precedes variable names in Bash. As with !!, the Bash interpreter will replace variables with their values before executing a command. So the command grep $HOME filename will be replaced with grep/home/username filename. To get around these problems, single quoting will cause the Bash interpreter to avoid any unfolding or interpretation. Double quoting will not prevent this problem because “$HOME” is still converted to the value stored in $HOME.

Grep attempts to match a regular expression to each line of the given file. This is perhaps not how we initially envisioned the use of regular expressions since we described a regular expression as matching to strings, not lines. In essence, grep treats each line of a file as a string, and looks to match the regular expression to any substring of the string. Either the pattern matches something on that line or it does not. The grep program will not search each individual string of a file (assuming strings are separated by white space). So, for instance, if a file had the line:

bat bait beat beet bet bit bite boot bout but

and we used the regular expression b.t, since b.t matches at least one item on the line, grep returns the entire line. Had we wanted to match each individual string on the line so that we only received as a response bat, bet, bit, but, we would have to resort to some other tactic.

One approach to using grep on each individual string of a file rather than each individual line would be to run a string tokenizer and pipe the results to grep. A string tokenizer is a program that separates every pair of items that have white space between them. Imagine that we have such a program called tokenizer. We could do

tokenizer filename | grep b.t

Notice in such a case, grep does not receive a filename(s), but instead the file information is piped directly to it.

Another thing to keep in mind is the need to use to escape the meaning of a character. We covered this earlier (Metacharacters) when we needed to literally look for a character in a string where the character was a metacharacter.

For instance, $ means “end of string”, but if we were looking for a $, we would indicate this as $. This is true in grep as it is in when specifying regular expressions in other settings. However, there are exceptions to the requirement for escaping the meaning (needing the ) in grep. For instance, if a hyphen is sought as one of a list of items, it might look like this: [!@#%&-=<>]. But recall that a hyphen inside of the [] is used to indicate a range; so instead we would have to specify this list as [!@#%&-=<>]. The - is used to indicate that the hyphen is sought literally, not to be used as a range. But there is an exception to this requirement. If the hyphen appears at the end of the list, there is no need to use , so the list could be [!@#%&=<>-].

There are other instances where the escape character () is not needed. One is of a list of characters that include a $, but the $ is not at the end of the list of characters. If we intend to use $ to mean “end of string”, grep does not expect to see any characters following it. Therefore, if we have $[0-9]+ (to indicate a dollar amount), grep treats the $ as a true dollar sign and not as the “end of string matching” metacharacter. The same is true of ^ if characters precede it. Finally, most characters lose their metacharacter meaning if found inside of [], so for instance we would not need to do [$] or [?] if we were searching for a $ or ?; instead, we could just use [$] or [?].

Let us examine grep (egrep) now. Figure 10.4 illustrates a portion of the results of applying the command egrep –n [0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3} /etc/*. The regular expression here is the one we developed earlier to obtain four numbers separated by dots, that is, our four octets to obtain IP addresses. This regular expression could potentially return items that are not IP addresses (such as a match to the string 999.999.999.999), but no such strings are found in the /etc directory. Figure 10.4 is not the entire result because the grep instruction returns too many matches. You could pipe the result to less so that you can step through them all, or you could redirect the output to a file to print out or examine over time.

Figure 10.4

Image of A result from grep

A result from grep.

Notice that in the output we see the filename for each match, the line number of the match, and the line itself. You might notice that the /etc/Muttrc file does not actually contain IP addresses; instead, the matches are version numbers for the software (mutt-1.4.2.2). Even if we had used our “correct” regular expression to match IP addresses (from Figure 10.3), we would still have matched the entries in the Muttrc file. We could avoid this by adding a blank space before the IP address regular expression so that the – before 1.4.2.2 would cause those lines to not match.

The command whose results are shown in Figure 10.4 had to be submitted by root. This is because many of the /etc files are not readable by an end user. Had an end user submitted the grep command, many of the same items would be returned, but the following error messages would also be returned:

egrep: securetty: Permission denied
egrep: shadow: Permission denied
egrep: shadow-: Permission denied
egrep: sudoers: Permission denied
egrep: tcsd.conf: Permission denied

among others.

You might also notice a peculiar line before the /etc/resolv.conf lines. It reads “Binary file /etc/prelink.cache matches”. This is informing us that a binary file contained matches. However, because we did not want to see any binary output, we are not shown the actual matches from within that file. We can force grep to output information from binary files (see Table 10.6). The grep program has a number of useful options; some common ones are listed in Table 10.6. Of particular note are –c, -E, -i, -n, and –v. We will discuss –v later.

Table 10.6

grep Options

-a

Process a binary file as if it were a text file (this lets you search binary files for specific strings of binary numbers)

-c

Count the number of matches and output the total, do not output any matches found

-d read

Used to handle all files of a given directory, use recurse in place of read to read all files of a given directory, and recursively for all subdirectories

-E

Use egrep (allow the extended regular expression set)

-e regex

The regular expression is placed after –e rather than where it normally is positioned in the instruction; this is used to protect the regular expression if it starts with an unusual character, for instance, the hyphen

-h

Suppress the filename from the output

-i

Ignore case (e.g., [a-z] would match any letter whether upper or lower case)

-L

Output only filenames that have no matches, do not output matches

-m NUM

Stop reading a file after NUM matches

-n

Output line numbers

-o

Only output the portion of the line that matched the regular expression

-R, -r

Recursive search (this is the same as –d recurse)

-v

Invert the match, that is, print all lines that do not match the given regular expression

Let us work out some examples. We will use a file of faculty office information to search for different matches. This file, offices.txt, stores for each faculty, their office location (building abbreviation and office number), their last name, and then the platform of computer(s) that they use. Each item is separated by a comma and the platform of computer might be multiple items. The options for platform are PC, Mac, Linux, Unix. Office locations are a two- or three-letter building designator, such as ST, MEP, or GH, followed by a space, followed by an office number, which is a three-digit number.

Write a grep command to find all entries on the third floor of their building (assuming their three-digit office number will be 3xx).

egrep ‘3[0-9][0-9]’ offices.txt

Write a grep command that will find all entries of faculty who have PC computers.

egrep ‘PC’ offices.txt

What if there is a building with PC as part of its three-letter abbreviation? This would match. We could try this instead.

egrep ‘PC$’ offices.txt

This expression will only match lines where PC appears at the end of the line. However, if a faculty member has multiple computers and PC is not listed last, this will miss that person. Consider that a line will look like this: 123 ABC, Fox, PC, Mac. If we want to match PC, it should match after a comma and a space. This will avoid matching a building, for instance, 456 PCA or 789 APC. So our new grep command is

egrep ‘, PC’ offices.txt

Notice that if the faculty member has multiple computers, each is listed as computer, computer, computer, and so forth. Therefore, if we have PC anywhere in the list, it will occur after a comma and a space. But if PC appears in a building, it will either appear after a digit and a space (as in 456 PCA) or after an upper-case letter (as in 789 APC).

Write a grep command that will find all entries of faculty who have more than one computer. Is there any way to include a “counter” in the egrep command? In a way, yes. Recall that we could control the number of matches expected by using {n, m}. But what do we want to actually count? Let us look at two example lines, a faculty with one computer and a faculty with more than one.

123 MEP, Newman, Mac
444 GH, Fox, PC, Linux

With one computer, the first entry only has two commas. With two computers, the second entry has three commas. This tells us that we should search for lines that have at least three commas. However, ‘,{3}’ is not an adequate regular expression because that would only match a line in which there were at least three consecutive commas (such as 456 PCA, Zappa, PC,,,Mac). We should permit any characters to appear before each comma. In fact, in looking at our example lines, all we care about is whether there are spaces and letters before the comma (not digits since the only digits occur at the beginning of the line). Our regular expression then is

egrep ‘([A-Za-z]+,){3,}’

Notice the use of the (). This is required so that the {3,} applies to the entire regular expression (rather than just the preceding character, which is the comma). We could actually simplify this by using the . metacharacter. Recall that . can match any single character. We can add + to indicate one or more of any type of character. Since all we are interested in are finding at least three commas, we can use either ‘.+, .+, .+,’ or we could use (.+,){3,}.

Write a grep command that will find all entries of faculty who have either a Linux or Unix machine. Here, we can use the OR option, as in (Linux|Unix). We could also try to spell this out using [] options. This would look like this: [L]?[iU]n[ui]x. The two grep commands are:

egrep ‘(Linux|Unix)’ offices.txt
egrep ‘[L]?[iU]n[ui]x’ offices.txt

Write a grep command that will find all entries of faculty who do not have a PC in their office. The obvious solution would be to use [^P][^C] in the expression. That is, find a string that does not have a P followed by C. Sadly, this expression, in egrep would return every line of the file. Why? Because the regular expression asks for any string that does not have PC. However, the way that grep works is that it literally compares a line for PC, and if the line is not exactly PC, then the line contains a match. If the line was 444 GH, Fox, PC, Linux, this will match because there are characters on the line that are not PC, for instance, the ‘44’ that starts the string. What we really want to have is a regular expression that reads “some stuff followed by no P followed by no C followed by some other stuff.” Creating such a regular expression is challenging. Instead, we could simply use the regular expression PC and add the –v option to our grep command. That is,

egrep –v ‘PC’ offices.txt

This command looks for every line that matches ‘PC’ and then return the other lines of the file. Unfortunately, if we have an entry with an office of PCA or APC, that line would not be returned whether they have a PC or not. Therefore, we adjust the regular expression to be ‘, PC,’ to avoid matching PC in the building name, so that the egrep command becomes

egrep –v ‘, PC’ offices.txt

Other Uses of Regular Expressions

With grep/egrep, we are allowed to use the full range of regular expression metacharacters. But in ls, we are limited to just using the wildcards. What if we wanted to search a directory (or a collection of directories) for certain files that fit a pattern that could be established by regular expression but not by wildcards? For instance, what if you wanted to list all files whose permissions were read-only? Recall from Chapter 6 that permissions are shown when you use the command ls –l. A read-only file would have permissions that contained r-- somewhere in the permissions list (we do not really care if the file is set as rwxrwxr-- or r-------- or some other variant just as long as r-- is somewhere in the permissions).

For ls, * has the meaning of “anything”. So literally we want “anything” followed by r-- followed by “anything”. Would ls –l *r--* accomplish this task for us? No. Let’s see how this instruction would work. First, the Bash interpreter unfolds the notation *r-- *. This means that the Bash interpreter obtains all names in the current directory that has anything followed by r followed by two hyphens followed by anything. For instance, foxr--text would match this pattern because of the r-- in the title. Once a list of files is obtained, they would be passed to the ls command, and a long listing would be displayed. Unfortunately, we have done this in the wrong order, we want the long listing first, and then we want to apply *r--*.

Our solution is quite easy. To obtain the long listing first, we do ls –l. We then pipe the result to egrep. Our instruction then becomes ls –l * | egrep ‘r--’ so that we obtain a long listing of all items in the current directory, and then we pass that listing (a series of lines) to egrep, which searches for any lines with r-- and returns only those. This command will return any line that contains r-- in the long listing. If there is a file called foxr--text, it is returned even if its permissions do not match r--. How can we avoid this? Well, notice that permissions are the first thing in the long listing and the filename is the last thing. We can write a more precise regular expression and include ^ to force it to match at the beginning of the line.

Permissions start with the file type. For this, we do not care what character we obtain, but it should only be a single character. We can obtain any single character using the period. We then expect nine more characters that will be r, w, x, or -. Of these nine characters, we will match on any r--. So we might use a regular expression (r--|[rwx-]{3}). This will match either r-- precisely or any combination of r, w, x, and – over three characters. Unfortunately, this will not work for us because it might match rwx or rw- or even ---. We could instead write this expression as (r--[rwx-]{6}|[rwx-]{3}r--[rwx-]{3}|[rwx-]{6}r--). Here, we require r-- to be seen somewhere in the expression. Now our command is rather more elaborate, but it prevents matches where r-- is found in the filename (or in the username or groupname). Figure 10.5 illustrates the solution; blank spaces are added around the “|” to help with readability.

Figure 10.5

Image of Solution to finding read only files in a directory using ls and egrep

Solution to finding read only files in a directory using ls and egrep.

You can combine ls –l and egrep to search for a variety of things such as files whose size is greater than 0, files that are owned by a specific user, or files created this year or this date. Can you think of a way to obtain the long listing of files whose size is greater than 0 using ls –l and egrep? This question is asked in this chapter’s review problems (see questions 22 and 23).

There are a variety of other programs that use regular expressions beyond grep. The sed program is a stream editor. This program can be used to edit the contents of a file without having to directly manipulate the file through an editor. For instance, imagine that you want to capitalize the first word of every line of a textfile. You could open the file in vi or Emacs and do the editing by hand. Or, you could use the sed program. In sed, you specify a regular expression and a replacement string. The regular expression describes what string you are searching for. The replacement string is used in place of the string found. You can specify a replacement literally, or you can apply commands such as “upper case” or “lower case”. You can remove an item as well.

The sed tool is very useful for making large substitutions quickly to a file. However, it requires a firm grasp of regular expressions. One simple example is to remove all html tags from an html file. One could define a regular expression as ‘<.*>’ and replace it with nothing (or a blank space). Since anything in < > marks is an html tag, a single sed command could find and remove all of them. Another usage for sed is to reformat a file. Consider a file where information is stored not line by line, but simply with commas to delimit each item. You could replace commas with tab characters ( ) and/or new line characters ( ).

Another program of note is awk. The name of the program is the initials of the three programmers who wrote awk. Whereas sed searches for strings to replace, awk searches for strings to process. With awk, you specify pairs of regular expressions and actions. If a line matches a regular expression, that line is processed via the actions specified. One simple example of awk’s use is to output specific elements of a line that matches a regular expression. In egrep, when a match is found, the entire line is output, but awk allows you to specify what you want to be output. In this way, awk is somewhat like a database program and yet it is far more powerful than a database because the matching condition is based on a regular expression.

Another example of using awk is to do mathematical operations on the matched items. For instance, imagine that a text file contains payroll information for employees. Among the information are the employees’ names, hours worked this week, wages, and tax information. With awk, we can match all employees who worked overtime and compute the amount of overtime pay that we will have to pay. Or, we might match every entry and compute the average number of hours worked.

Regular expressions have been incorporated into both vi and Emacs. When searching for a string, you can specify a literal string, but you can also specify a regular expression. As with sed, this allows you to identify specific strings of interest so that you can edit or format them.

Finally, regular expressions have been incorporated into numerous programming languages. A form of regular expression was first introduced in the language SNOBOL (StriNg Oriented and SymBOlic Language) in the early 1960s. However, it was not until 1987 that regular expressions made a significant appearance in a programming language, and that language was Perl. Perl’s power was primarily centered around defining regular expressions and storing them in variables. Perl was found to be so useful that it became a language that was used to support numerous Internet applications including web server scripting. Since then, regular expressions have been incorporated into newer languages including PHP, Java, JavaScript, the .Net platform, Python, Ruby, and Visual Basic.

We end this chapter with two very bad regular expression jokes.

“If you have a problem, and you think the solution is using regular expressions, then you have two problems.”

Q: What did one regular expression say to the other?

A: .*

Further Reading

Regular expressions are commonly applied in a number of settings whether you are a Linux user, a system administrator, a mathematician, a programmer, or even an end user using vi or Emacs. Books tackle the topic from different perspectives including a theoretical point of view, an applied point of view, and in support of programming. This list contains texts that offer practical uses of regular expressions rather than theoretical/mathematical uses.

  • Friedl, J. Mastering Regular Expressions. Cambridge, MA: O’Reilly, 2006.
  • Goyvaerts, J. and Levithan, S. Regular Expressions Cookbook. Cambridge MA: O’Reilly, 2009.
  • Habibi, M. Java Regular Expressions: Taming the java.util.regex Engine. New York: Apress, 2003.
  • Stubblebine, T. Regular Expressions for Perl, Ruby, PHP, Python, C, Java and.NET. Cambridge, MA: O’Reilly, 2007.
  • Watt, A. Beginning Regular Expressions (Programmer to Programmer). Hoboken, NJ: Wrox, 2005.

Two additional texts are useful if you want to delve more deeply into grep, awk, and sed.

  • Bambenek, J. and Klus, A. Grep Pocket Reference. Massachusetts: O’Reilly, 2009.
  • Dougherty, D. and Robbins, A. sed & awk. Cambridge, MA: O’Reilly, 1997.

Review terms

Terminology introduced in this chapter

Enumerated list Regular expressions

Escape sequence String tokenizer

Filename expansion White space

Globbing Wildcard

Metacharacters

Review Questions

  1. What is a regular expression?
  2. What does a regular expression convey?
  3. What do you match regular expressions against?
  4. What is the difference between a literal character and a metacharacter in a regular expression?
  5. What is the difference between * and + in a regular expression?
  6. What is the difference between * and ? in a regular expression?
  7. What is the meaning behind a :class: when used in a regular expression? What does the class alnum represent? What does the class punct represent?
  8. Why does the regular expression [0-255] not mean “any number from 0 to 255”?
  9. How does the regular expression . differ from the regular expression [A-Za-z0-9]?
  10. What does the notation {2,3} mean in a regular expression? What does the notation {2,} mean in a regular expression?
  11. What does the escape sequence d mean? What does the escape sequence D mean?
  12. What does the escape sequence w mean? What does the escape sequence W mean?
  13. Does ^… have the same meaning as [^…] in a regular expression?
  14. How does * differ between ls and grep?
  15. Why does * differ when used in ls than in grep?
  16. Which of the regular expression metacharacters are also used by the Bash interpreter as wildcard characters?
  17. What is the difference between grep and egrep?
  18. Provide some examples of how you might use regular expressions as a Linux user.
  19. Provide some examples of how you might use regular expressions as a Linux system administrator.

Review Problems

  1. Write a regular expression that will match any number of the letters a, b, c in any order, and any number of them.
  2. Repeat #1 except that we want the letters to be either upper or lower case.
  3. Repeat #2 except that we want the a’s to be first, the b’s to be second, and the c’s to be last.
  4. Repeat #3 except that we only want to permit between 1 and 3 a’s, 2 and 4 b’s, and any number of c’s.
  5. Repeat #1 except that we want the letters to be either all upper case or all lower case.
  6. Write a regular expression to match the letter A followed by either B or b followed by either c, d, e, or f in lower case, followed by any character, followed by one or more letter G/g (any combination of upper and lower case).
  7. What does the following regular expression match against?
    [a-z]+[0-9]*[a-z]*
  8. What does the following regular expression match against? [A-Za-z0-9]+[^0-9]+
  9. What does the following regular expression match against?
    ^[a-z]*$
  10. Write a regular expression that will match a phone number with area code in () as in (859) 572-5334 (note the blank space after the close paren).
  11. Repeat #10 except that the regular expression will also match against a phone number without the area code (it will start with a digit, not a blank space).
  12. Write a Linux ls command that will list all files in the current directory whose name includes two consecutive a’s as in labaa.txt or aa1.c.
  13. Write a Linux ls command that will list all files that contain a digit somewhere in their name.
  14. Write a Linux ls command that will list all files whose name starts with the word file, is followed by a character and ends in .txt, for instance, file1.txt, file2.txt, filea.txt.
  15. Repeat #14 so that file.txt will also be listed, that is, list all files that starts with the word file followed by 0 or 1 character, followed by .txt.
  16. Write a grep command to find all words in the file dictionary.dat that have two consecutive vowels.
  17. Repeat #16 to find all words that have a q (or Q) that is not followed by a u.
  18. Repeat #16 to find all five-letter words that start with the letter c (or C).
  19. Repeat #18 to find all five-letter words.
  20. Using a pipe, combine a Linux ls and grep command to obtain a list of files in the current directory whose name includes two consecutive vowels (in any combination, such as ai, eo, uu).
  21. Using a pipe, combine a Linux ls –l and grep command to obtain a list of files in the current directory whose permissions have at least three consecutive hyphens. For instance, it would find a file whose permissions are –rwxr----- or –rwxrwx--- but not –rwxr-xr--.
  22. Using a pipe, combine a Linux ls –l and grep command to obtain a list of files in the current directory whose file size is 0.
  23. Repeat #22 except that the file size should be greater than 0.

Discussion Questions

  1. As a Linux user, provide some examples of how you might apply regular expressions. HINT: consider the examples we saw when combining grep with ls commands. Also consider the use of regular expressions in vi and Emacs searches.
  2. As a system administrator, provide some examples of how you might apply regular expressions.
  3. Explore various programming languages and create a list of the more popular languages in use today, and indicate which of those have capabilities of using regular expressions and which do not.
  4. One of the keys to being a successful system administrator is learning all of the tools that are available to you. In Linux, for instance, mastering vi and learning how to use history and command line editing are very useful. As you responded in #2, using regular expressions is another powerful tool. How do regular expressions compare to vi, history, command line editing with respect to being a successful system administrator? Explain.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.14.93