Chapter 9. Getting Control—Regular Expression Metacharacters

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Getting Control—Regular Expression Metacharacters

By the end of this chapter, you will be able to unravel and use the following regular expressions:

die unless (/^.+@[^.].*.[a-z]{2,}$/);

$money =~ s/(?<=d)(?=(ddd)+(?!d))/,/g

9.1 The RegExLib.com Library

Before getting deep into the weeds, let’s take a look at the regexlib.com Web site. This Web site allows you to search for a pattern and will show you a list of regular expression solutions and a rating on how well each one performs its pattern-matching task. Although the Web site may not be 100 percent Perlish in the way it handles regexes, it is certainly a good research tool when you’re trying to get some clues on how to write your own. The following is the opening statement found at the home page of RegExLib.com (also shown in Figure 9.1).

Figure 9.1 The RegExLib.com home page.

Welcome to RegExLib.com, the Internet’s first Regular Expression Library. Currently we have indexed 3800 expressions from 2172 contributors around the world. We hope you’ll find this site useful and come back whenever you need help writing an expression, you’re looking for an expression for a particular task, or are ready to contribute new expressions you’ve just figured out. Thanks!

If you look closely at Figure 9.1, you will see a magnifying glass with a search box next to it. In this box, the word email has been typed. If the search button is clicked, another page will appear with a variety of regular expressions that have been designed by different programmers to match for a valid email address (see Figure 9.2). The purpose of each regex is defined and given a rating (the number of green boxes) on its quality, much like grading a hotel. The more green boxes, the better the regex—five being the best, as in a five-star hotel.

Figure 9.2 Search results.

By the time we finish this chapter, you should be able to read any of the regex examples found here. Once you understand all the metacharacters and how they are used, you can write you own regular expressions or use the ones provided here. Knowing what the regular expression is matching on and being able to test it right at the Web site is a great time-saving tool. In Figure 9.2, you can see some examples of how to validate an email address. Note the ratings, the test box, and the description.

9.2 Regular Expression Metacharacters

So what are these metacharacters? Regular expression metacharacters are characters that do not represent themselves. They are endowed with special powers to allow you to control the search pattern in some way (for example, find the pattern only at the beginning of the line, or at the end of the line, or only if it starts with an upper- or lowercase letter). Metacharacters lose their special meaning if preceded with a backslash (). For example, the dot metacharacter represents any single character, but when preceded with a backslash, is just a dot or period.

If you see a backslash preceding a metacharacter, the backslash turns off the meaning of the metacharacter, but if you see a backslash preceding an alphanumeric character in a regular expression, then the backslash means something else; for example, d means one decimal number. Perl provides a simpler form of some of the metachacters, called metasymbols, to represent characters. For example, [0-9] represents numbers in the range between 0 and 9, and d represents the same thing. [0-9] uses the bracket metacharacter; d is a metasymbol. Table 9.1 describes the metacharacters and what they do.

Explanation

This regular expression contains metacharacters. (See Table 9.1.) The first one is a caret (^). The caret metacharacter matches for a string only if it is at the beginning of the line. The period (.) is used to match for any single character, including whitespace. This expression contains three periods, representing any three characters. To find a literal period or any other character that does not represent itself, the character must be preceded by a backslash to prevent interpretation.

In Example 9.1, the regular expression reads: search at the beginning of the line for an a, followed by any three single characters, followed by a c. What comes after the c could be any characters. It will match, for example, abbbc, a123c, a c, or aAx3cde only if those patterns were found at the beginning of the line.

Table 9.1 Metacharacters

9.2.1 Metacharacters for Single Characters

If you are searching for a particular character within a regular expression, you can use the dot metacharacter to represent a single character or a character class that matches one character from a set of characters. In addition to the dot and character class, Perl has added some backslashed symbols (called metasymbols) to represent single characters. See Table 9.2.¹

Table 9.2 Metacharacters for Single Characters

1. The metasymbols match on more than just the alphanumeric characters; they are “Unicode” aware.

The Dot Metacharacter

The dot (.) metacharacter matches any single character with the exception of the newline character. For example, the regular expression /a.b/ is matched if the string contains an a, followed by any one single character (except the ), followed by b, whereas the expression /.../ matches any string containing at least three characters.

Explanation

1. The special DATA filehandle gets its input from the text after the _ _DATA_ _ token. The while loop is entered and the first line following the _ _DATA_ _ token is read in and assigned to $_. Each time the loop is entered, the next line below _ _DATA_ _ is assigned to $_ until all the lines have been processed.

2. The string Found Norma! is printed only if the pattern found in $_ contains an uppercase N, followed by any two single characters, followed by an m and an a. It would find Norma, No man, Normandy, and so forth.

The s Modifier—The Dot Metacharacter and the Newline

Normally, the dot metacharacter does not match the newline character, , because it matches only the characters within a string up until the newline is reached. The s modifier treats the line with embedded newlines as a single line, rather than a group of multiple lines, and allows the dot metacharacter to treat the newline character the same as any other character it might match. The s modifier can be used with both the m (match) and the s (substitution) operators.

Explanation

1. The $_ scalar is assigned; it contains two newlines.

2. The regular expression, /pence./, contains a dot metacharacter. The dot metacharacter does not match a newline character unless the s modifier is used. The $& special scalar holds the value the pattern found in the last successful search; that is, pence .

3. The regular expression /rye../ contains a literal period (the backslash makes the period literal), followed by the dot metacharacter that will match on the newline, thanks to the s modifier. The $& special scalar holds the value the pattern found in the last successful search; that is, rye. .

4. The s modifier allows the dot to match on the newline character, , found in the search string. The newline will be replaced with a space.

The Character Class

A character class represents one character from a set of characters. For example, [abc] matches an a, b, or c, and [a-z] matches one character from a set of characters in the range from a to z, and [0-9] matches one character in the range of digits between 0 and 9. If the character class contains a leading caret (^), then the class represents any one character not in the set; for example, [^a-zA-Z] matches a single character not in the range from a to z or A to Z, and [^0-9] matches a single character not in the range between 0 and 9.² To represent a number between 10 and 13, use 1[0-3], not [10-13].

2. Don’t confuse the caret inside square brackets with the caret used as a beginning of line anchor. See Table 9.7.

Perl provides additional symbols, metasymbols, to represent a character class. The symbols d and D represent a single digit and a single nondigit, respectively; they are the same as [0-9] and [^0-9]. Similarly, w and W represent a single word character and a single nonword character, respectively; they are the same as [A-Za-z_0-9] and [^A-Za-z_0-9].

The POSIX Bracket Expressions

Perl 5.6 introduced the POSIX, a special kind of character classes, called bracket expressions. POSIX (the Portable Operating System Interface³) is an industry standard used to ensure that programs are portable across operating systems. In order to be portable, POSIX recognizes that different countries or locales may vary in the way characters are encoded, the symbols used to represent currency, and how times and dates are represented. To handle these different types of characters, POSIX (the bracketed character class of characters) is used (see Table 9.3). The POSIX module permits you to access all (or nearly all) the standard POSIX 1003.1 identifiers.

Table 9.3 The Bracketed Character Class

3. POSIX is a registered trademark of the IEEE. See http://www.opengroup.org/austin/papers/backgrounder.html.

The class [:alnum:] is another way of saying A-Za-z0-9. To use this class, it must be enclosed in another set of brackets for it to be recognized as a regular expression. For example, A-Za-z0-9, by itself, is not a regular expression character class, but [A-Za-z0-9] is. Likewise, [:alnum:] should be written [[:alnum:]]. The difference between using the first form, [A-Za-z0-9], and the bracketed form, [[:alnum:]], is that the first form is dependent on ASCII character encoding, whereas the second form allows characters from other languages to be represented in the class. (For more on POSIX expressions, see www.regular-expressions.info/posixbrackets.html.)

To negate one of the characters in the POSIX character class, the syntax is as follows:

Click here to view code image

[^[:space:]] - all nonwhitespace characters

Explanation

1. Perl 5.6.0 (and above) is needed to use the POSIX character class. (By now, everyone should have a version of Perl higher than 5.6.)

2. The special DATA filehandle gets its input from the text after the _ _DATA_ _ token. The while loop is entered and the first line after the _ _DATA_ _ token is read in and assigned to $_. Each time the loop is entered, the next line following _ _DATA_ _ is assigned to $_ until all the lines have been processed.

3. The regular expression contains POSIX character classes. The line is printed if $_ contains one uppercase letter, [[:upper:]], followed by one or more (+) alphabetic characters, [[:alpha:]], a space, followed by an uppercase letter, and one or more lowercase alphabetic characters, [[:lower:]]. (The + is a regular expression metacharacter representing one or more of the previous characters.) The line joe blow does not match this pattern.

9.2.2 Whitespace Metacharacters

A whitespace character represents a space, tab, return, newline, or form feed. The whitespace character can be represented literally, by pressing a Tab key or the spacebar or the Enter key. See Table 9.4.

Table 9.4 Whitespace Metacharacters

Explanation

1. The special DATA filehandle gets its input from the text after the _ _DATA_ _ token. The while loop is entered and the first line after the _ _DATA_ _ token is read in and assigned to $_. Each time the loop is entered, the next line following _ _DATA_ _ is assigned to $_ until all the lines have been processed.

2. The line $_ is printed if it matches a pattern containing a whitespace character (space, tab, newline), s. All whitespace characters (other than the newline) are replaced with an *.

9.2.3 Metacharacters to Repeat Pattern Matches

In the previous examples, the metacharacter matched on a single character. What if you want to match on more than one character? For example, let’s say you are looking for all lines containing names, and the first letter must be in uppercase—which can be represented as [A-Z]—but the following letters are lowercase, and the number of letters varies in each name. [a-z] matches on a single lowercase letter. How can you match on one or more lowercase letters? Or zero or more lowercase letters? To do this, you can use what are called quantifiers. To match on one or more lowercase letters, the regular expression can be written /[a-z]+/ where the + sign means “one or more of the previous characters,” which in this case is one or more lowercase letters. Perl provides a number of quantifiers, as shown in Table 9.5.

Table 9.5 The Greedy Metacharacters

The Greed Factor

Normally, quantifiers are greedy; in other words, they match on the largest possible set of characters starting at the left-hand side of the string and searching to the right, look for the last possible character that would satisfy the condition. For example, given the following string:

$_="ab123456783445554437AB"

and the regular expression

s/ab[0-9]*/X/;

the search side would match

ab123456783445554437

All of this will be replaced with an X. After the substitution, $_ would be

XAB

The asterisk (*) is a greedy metacharacter. It matches for zero or more of the preceding characters. In other words, it attaches itself to the character preceding it and looks only for zero or more occurrences of that character. In the preceding example, the asterisk attaches itself to the character class [0-9]. The matching starts on the left, searching for ab followed by zero or more numbers in the range between 0 and 9. The matching continues until the last number is found; in this example, the number 7. The pattern ab and all of the numbers in the range between 0 and 9 are replaced with a single X. The trailing characters, AB, remain.

Greediness can be turned off so that instead of matching on the greatest number of characters, the match is made on the least number of characters found. This is done by appending a question mark after the greedy metacharacter (see Example 9.23).

Explanation

2. The regular expression contains .*, where the * represents zero or more of the previous character. In this example, the previous character is the dot metacharacter, which represents any character at all. This expression reads: find an uppercase letter, [A-Z], followed by zero or more of any character, .*, followed by the letter y. If there is more than one y on the line, the search will include all characters up until the last y. Both Betty and Igor Chevsky are matched. Note that the space in Igor Chevsky is included as one of the characters matched by the dot metacharacter.

Explanation

The expression reads: find three consecutive occurrences of the pattern 5. This does not mean that the string must contain exactly three, and no more, of the number 5. It just means that there must be at least three consecutive occurrences of the number 5. If the string contained 5555555, the match would still be successful. To find exactly three occurrences of the number 5, the pattern would have to be anchored in some way, either by using the ^ and $ anchors or by placing some other character before and after the three occurrences of the number 5; for example, /^5{3}$/ or / 5{3}898/ or /95{3}.56/.

Metacharacters That Turn off Greediness

By placing a question mark after a greedy quantifier, the greed is turned off, and the search ends after the first match rather than the last one. Table 9.6 describes the metacharacters that turn off greediness.

Table 9.6 Turning off Greediness

Explanation

1. The scalar $_ is assigned a string of lowercase letters.

2. The regular expression reads: search for one or more lowercase letters, and replace them with XXX. The + metacharacter is greedy. It takes as many characters as match the expression; meaning, it starts on the left-hand side of the string, grabbing as many lowercase letters as it can find until the end of the string.

3. The value of $_ is printed after the substitution. The whole string has been replaced with XXX.

4. The scalar $_ is again assigned a string of lowercase letters.

5. The regular expression reads: search for one or more lowercase letters, and, after finding the first one, stop searching and replace it with XXX. The ? affixed to the + turns off the greediness of the metacharacter. The minimal number of characters is searched for. The a is replaced with XXX and the rest of the string remains untouched.

6. The value of $_ is printed after the substitution.

EXAMPLE 9.24

Click here to view code image

(The Script)
   # A greedy quantifier
1  $string="I got a cupful of sugar and two cups of flour
           from the cupboard.";

2  $string =~ s/cup.*/tablespoon/;
3  print "$string ";
   # Turning off greed
4  $string="I got a cupful of sugar and two cups of flour
           from the cupboard.";
5  $string =~ s/cup.*?/tablespoon/;
6  print "$string ";

(Output)
3  I got a tablespoon
6  I got a tablespoonful of sugar and two cups of flour from the cupboard.

Explanation

1. The scalar $string is assigned a string containing the pattern cup three times.

2. The s (substitution) operator searches for the pattern cup followed by zero or more characters; that is, cup and all characters to the end of the line are matched and replaced with the string tablespoon. The .* is called a greedy quantifier because it matches for the largest possible pattern.

3. The output shows the result of a greedy substitution.

4. The scalar $string is reset.

5. This time the search is not greedy. By appending a question mark to the .*, the smallest pattern that matches cup, followed by zero or more characters, is replaced with tablespoon. Only the first cup will be replaced with tablespoon resulting in tablespoonful. This example is to demonstrate the way greedy metacharacters work. Another way to write the regex would be: s/cupful/tablespoonful/

6. The new string is printed.

Anchoring Metacharacters

Often, it is necessary to anchor a metacharacter so that it matches only if the pattern is found at the beginning or end of a line, word, or string. These metacharacters are based on a position just to the left or to the right of the character that is being matched. Anchors (see Table 9.7) are technically called zero-width assertions because they correspond to positions, not actual characters in a string. For example, /^abc/ means: find abc at the beginning of the line, where the ^ represents a position, not an actual character.

Table 9.7 Anchors (Assertions)

Explanation

The regular expression contains the caret (^) metacharacter, representing the beginning-of-line anchor only when it is the first character in the pattern. The expression reads: find a J or K at the beginning of the line. A would produce the same result as the caret in this example. The expression /^[^JK]/ reads: search for a non-J or non-K character at the beginning of the line. Remember that when the caret is within a character class, it negates the character class. It is a beginning-of-line anchor only when positioned directly after the opening delimiter.

The m Modifier

The m modifier is used to control the behavior of the $ and ^ anchor metacharacters. A string containing newlines will be treated as multiple lines. If the regular expression is anchored with the ^ metacharacter, and that pattern is found at the beginning of any one of the multiple lines, the match is successful. Likewise, if the regular expression is anchored by the $ metacharacter (or ) at the end of any one of the multiple lines, and the pattern is found, it too will return a successful match. The m modifier has no effect with A and z.

EXAMPLE 9.29

Click here to view code image

(The Script)
   use warnings;
   # Anchors and the m modifier
1  $_="Today is history. Tomorrow will never be here. ";
2  print if /^Tomorrow/;    # Embedded newline

3  $_="Today is history. Tomorrow will never be here. ";
4  print if /ATomorrow/;   # Embedded newline

5  $_="Today is history. Tomorrow will never be here. ";
6  print if /^Tomorrow/m;

7  $_="Today is history. Tomorrow will never be here. ";
8  print if /ATomorrow/m;

9  $_="Today is history. Tomorrow will never be here. ";
10 print if /history.$/m;

(Output)
6  Today is history.
   Tomorrow will never be here.
10 Today is history.
   Tomorrow will never be here.

Explanation

1. The $_ scalar is assigned a string with embedded newlines.

2. The ^ metacharacter anchors the search to the beginning of the line. Since the line does not begin with Tomorrow, the search fails and nothing is returned.

3. The $_ scalar is assigned a string with embedded newlines.

4. The A assertion matches only at the beginning of a string, no matter what. Since the string does not begin with Tomorrow, the search fails and nothing is returned.

5. The $_ scalar is assigned a string with embedded newlines.

6. The m modifier treats the string as multiple lines, each line ending with a newline. In this example, the ^ anchor matches at the beginning of any of these multiple lines. The pattern /^Tomorrow/ is found in the second line.

7. The $_ scalar is assigned a string with embedded newlines.

8. The A assertion matches only at the beginning of a string, no matter how many newlines are embedded, and the m modifier has no effect. Since Tomorrow is not found at the beginning of the string, nothing is matched.

9. The $_ scalar is assigned a string with embedded newlines.

10. The $ metacharacter anchors the search to the end of a line. With the m modifier, embedded newlines create multiple lines. The pattern /history.$/ is found at the end of the first line. This will also work with the assertion but not with z.

Alternation

Alternation allows the regular expression to contain alternative patterns to be matched. For example, the regular expression /John|Karen|Steve/ will match a line containing John or Karen or Steve. If Karen, John, or Steve are all on different lines, all lines are matched. Each of the alternative expressions is separated by a vertical bar (pipe symbol) and the expressions can consist of any number of characters, unlike the character class that matches for only one character; for example, /a|b|c/ is the same as [abc], whereas /ab|de/ cannot be represented as [abde]. The pattern /ab|de/ is either ab or de, whereas the class [abcd] represents only one character in the set, a, b, c, or d.

Grouping or Clustering

If the regular expression pattern is enclosed in parentheses, a subpattern is created. Then, for example, instead of the greedy metacharacters matching on zero, one, or more of the previous single characters, they can match on the previous subpattern. Alternation can also be controlled if the patterns are enclosed in parentheses. This process of grouping characters together is also called clustering by the Perl wizards.

Remembering or Capturing

If the regular expression pattern is enclosed in parentheses, a subpattern is created. The subpattern is saved in special numbered scalar variables, starting with $1, then $2, and so on. These variables can be used later in the program and will persist until another successful pattern match occurs, at which time they will be cleared. Even if the intention was to use grouping to create as shown in the previous examples, the subpatterns are saved as a side effect.⁴

4. It is possible to prevent a subpattern from being saved.

Explanation

The regular expression contains the pattern Jon enclosed in parentheses. This pattern is captured and stored in a special scalar, $1, so it can be remembered. (The curly braces used here ${1} are not required, but insulate the number 1 from the string that follows it.) If a second pattern is enclosed in parentheses, it will be stored in $2, and so on. The numbers are represented on the replacement side as $1, $2, $3, and so on. The expression reads: find Jon or jon and replace with either Jonathan or jonathan, respectively. The special numbered variables are cleared after the next successful search is performed.

EXAMPLE 9.37

Click here to view code image

(The Script)
   use warnings;
   # Reversing subpatterns
1  while(<DATA>){
2     s/([A-Z][a-z]+)s([A-Z][a-z]+)/$2, $1/;
                                      # Reverse first and last names
3     print;
   }
   _ _DATA_ _
   Steve Blenheim
   Betty Boop
   Igor Chevsky
   Norma Cord
   Jon DeLoach
   Karen Evich

(Output)
Blenheim, Steve
Boop, Betty
Chevsky ,Igor
Cord, Norma
De, JonLoach     # Whoops!
Evich, Karen

Explanation

This regular expression also contains two patterns enclosed in parentheses. In this example, metacharacters are used in the pattern-matching process. The first pattern reads: find an uppercase letter followed by one or more lowercase letters. A space follows the remembered pattern. The second pattern reads: find an uppercase letter followed by one or more lowercase letters. The patterns are saved in $1 and $2, respectively, and then reversed on the replacement side. Note the problem that arises with the last name, DeLoach. That is because DeLoach contains both uppercase and lowercase letters after the first uppercase letter in the name. To allow for this case, the pattern should be s/([A-Z][a-z]+)s([A-Z] [A-Za-z]+)/$2, $1/.

Turning off Greed

Greed can be turned off using the question mark (?) character.

EXAMPLE 9.42

Click here to view code image

(The Script)
   use warnings;
   # Capturing and greed
1  my $fruit="apples pears peaches plums";
2  $fruit =~ /(.*)s(.*)s(.*)/;
3  print "$1 ";
4  print "$2 ";
5  print "$3 ";
   print "-" x 30, " ";
6  $fruit="apples pears peaches plums";
7  $fruit =~ /(.*?)s(.*?)s(.*?)s/;   # Turn off greedy quantifier
8  print "$1 ";
9  print "$2 ";
10 print "$3 ";
(Output)
3  apples pears
4  peaches
5  plums
   ------------------------------
8  apples
9  pears
10 peaches

Explanation

1. The scalar $fruit is assigned the string.

2. The string is divided into three remembered substrings, each substring enclosed within parentheses. The .* metacharacter sequence reads zero or more of any character. The * always matches for the largest possible pattern. The largest possible pattern would be the whole string. However, there are two whitespaces outside of the parentheses that must also be matched in the string. What is the largest possible pattern that can be saved in $1 and still leave two spaces in the string? The answer is apples pears.

3. The value of $1 is printed.

4. The first substring was stored in $1. peaches plums is what remains of the original string. What is the largest possible pattern (.*) that can be matched and still have one whitespace remaining? The answer is peaches. peaches will be assigned to $2. The value of $2 is printed.

5. The third substring is printed. plums is all that is left for $3.

6. The scalar $fruit is assigned the string again.

7. This time, a question mark follows the greedy quantifier (*). This means that the pattern saved will be the minimal, rather than the maximal, number of characters found. apples will be the minimal numbers of characters stored in $1, pears the minimal number in $2, and peaches the minimal number of characters in $3. The s is required or the minimal amount of characters would be zero, since the * means zero or more of the preceding character.

8. The value of $1 is printed.

9. The value of $2 is printed.

10. The value of $3 is printed.

Turning off Capturing

When the only purpose is to use the parentheses for grouping, and you are not interested in saving the subpatterns in $1, $2, or $3, the special ?: metacharacter can be used to suppress the capturing of the subpattern.

Explanation

1. The $_ scalar is assigned a string.

2. The ?: turns off capturing when a pattern is enclosed in parentheses. In this example, alternation is used to search for any of two patterns. If the search is successful, the value of $_ is printed, but whichever pattern is found, it will not be captured and assigned to $1.

3. Without the ?:, the value of $1 would be Tom, since it is the first pattern found. ?: says “Don’t save the pattern when you find it.” Nothing is saved and nothing is printed.

Metacharacters That Look Ahead and Behind

Suppose you want to find and replace words in a document that are followed by a comma. In your search string, you have the word you are looking for followed by the comma as part of the search criteria, but you want to exclude the comma when replacing the word. Looking ahead for a pattern that will be matched and then excluded, in this case the comma, is called a positive look ahead. A negative look ahead would look ahead for a character that is not there.

A positive look ahead is an assertion like the ^ and $ anchors in that it represents a position in the search. A regular expression contains the positive look ahead as /regex (?=pattern)/. So for example, if you say s/John (?=Doe)/Jane/, the regex engine will search for John and look ahead to see if Doe follows, and if it does, then the positive look ahead match is true and Doe is completely discarded (and will not be captured in $1). Doe will not be included in what is replaced. Only John will be replaced with Jane.

A negative look ahead looks ahead to see if the pattern (?!pattern) is not there, and if it is not, succeeds, discarding the pattern after the ?!.

With a positive look behind, Perl looks backward in the string for a pattern (?<=pattern) and if that pattern is found, will then continue pattern matching on the regular expression, discarding the pattern in parentheses. A negative look behind looks behind in the string to see if a pattern (?<!pattern) is not there, and if it is not, succeeds in the matching. See Table 9.8.

Table 9.8 Look Around Assertions

EXAMPLE 9.44

Click here to view code image

(The Script)
   use warnings;
   # A positive look ahead
1  my $string="I love chocolate cake and chocolate ice cream.";
2  $string =~ s/chocolate (?=ice)/vanilla /;
3  print "$string ";
4  $string="Tomorrow night Tom Savage and Tommy Johnson will leave
           for vacation.";
5  $string =~ s/Tom(?=my)/Jere/g;
6  print "$string ";

(Output)
3  I love chocolate cake and vanilla ice cream.
6  Tomorrow night Tom Savage and Jeremy Johnson will leave for vacation.

Explanation

1. The scalar $string contains chocolate twice; the word cake follows the first occurrence of chocolate, and the word ice follows the second occurrence.

2. This is an example of a positive look ahead. The pattern chocolate is followed by (?=ice); meaning, if chocolate is found, look ahead (?=) and see if ice is the next pattern. If ice is found just ahead of chocolate, the match is successful and chocolate will be replaced with vanilla. The look ahead part, ice, is discarded. It is not part of the pattern to be replaced, but only there to help further define which chocolate we are looking for.

3. After the substitution on line 2, the new string is printed.

4. The scalar $string is assigned a string of text consisting of three words starting with Tom.

5. The pattern is matched if it contains Tom, only if Tom is followed by my. If the positive look ahead is successful, then Tom will be replaced with Jere in the string.

6. After the substitution on line 5, the new string is printed. Tommy has been replaced with Jeremy.

EXAMPLE 9.46

Click here to view code image

(The Script)
   use warnings;
   # A positive look behind
1  my $string="I love chocolate cake, chocolate milk,
           and chocolate ice cream.";
2  $string =~ s/(?<= chocolate) milk/ candy bars/;
3  print "$string ";

4  $string="I love coffee, I love tea, I love the boys
           and the boys love me.";
5  $string =~ s/(?<=the boys) love/ don't like/;
6  print "$string ";

(Output)
3  I love chocolate cake, chocolate candy bars, and chocolate ice cream.
6  I love coffee, I love tea, I love the boys and the boys don't like me.

Explanation

1. The scalar $string is assigned a string with three different occurrences of chocolate.

2. The pattern in parentheses is called a positive look behind, meaning that Perl looks backward in the string to make sure this pattern occurs. If the pattern milk is found, Perl will look back in the string to see if it is preceded by chocolate and, if so, milk will be replaced with candy bars. The look behind pattern, chocolate, is not affected by the replacement, so now we have chocolate candy bars.

3. The string is printed after the substitution.

4. This is another example of a positive look behind. Perl looks backward in the string for the pattern the boys, and if the pattern is found, the regular expression love will be replaced with don’t like.

9.2.4 The tr or y Operators

The tr operator translates characters on a one-to-one basis. To see what this means, let’s compare translation to substitution. You can see in the following example that the syntax for both the tr operator and substitution operator look very much the same, but they are really quite different in what they do. Let’s take a look at substitution first:

Click here to view code image

$str = "Elizabeth likes little baby lizards. ";
$str =~ s/Elizabeth/Christopher/;
print "$str ";

and the result is:

Click here to view code image

Christopher likes little baby lizards.

Now let’s look at the tr function.

Click here to view code image

$str = "Elizabeth likes little baby lizards. ";
$str =~ tr/Elizabeth/Christopher/;
print "$str ";

and the result is:

Click here to view code image

Christoph hrkos hrppho tsty hrisrds.

What is different? The s operator searches for a pattern and replaces it with a string; meaning, Elizabeth is replaced with Christopher. The tr operator⁵ translates characters, on a one-on-one correspondence, from each character in the search string to its corresponding character in the replacement string and returns the number of characters it replaced. In the preceding example, every E in $str, is translated to a corresponding C, every l is translated to an h, every i is transalted to an r, and so on.

5. The Perl tr function is derived from the UNIX tr command.

The tr operator does not interpret regular expression metacharacters but allows a dash to represent a range of characters. The letter y can be used in place of tr. This strangeness comes from UNIX, where the sed utility has a y command to translate characters, similar to the UNIX tr. If you look at the UNIX tr man page, you can see that it is very similar to the Perl tr function, illustrating the role UNIX has played in the development of Perl.

The d option deletes the search string.

The c option complements the search string.

The s option is called the squeeze option. Multiple occurrences of characters found in the search string are replaced by a single occurrence of that character (for example, you may want to replace multiple tabs with single tabs). See Table 9.9 for a list of modifiers.

Table 9.9 tr Modifiers

EXAMPLE 9.48

(The Input Data)
   Steve Blenheim 101
   Betty Boop 201
   Igor Chevsky 301
   Norma Cord 401
   Jon DeLoach 501
   Karen Evich 601

(Lines from a Script)
1  tr/a-z/A-Z/;print;

(Output)
STEVE BLENHEIM  101
BETTY BOOP  201
IGOR CHEVSKY  301
NORMA CORD  401
JON DELOACH  501
KAREN EVICH  601

2  tr/0-9/:/; print;

(Output)
Steve Blenheim :::
Betty Boop :::
Igor Chevsky :::
Norma Cord :::
Jon DeLoach :::
Karen Evich :::

3  tr/A-Z/a-c/;print;

(Output)
cteve blenheim 101
betty boop 201
cgor chevsky 301
corma cord 401
con cecoach 501
caren cvich 601
4  tr/  /#/; print;

(Output)
Steve#Blenheim#101
Betty#Boop#201
Igor#Chevsky#301
Norma#Cord#401
Jon#DeLoach#501
Karen#Evich#601

5  y/A-Z/a-z/;print;

(Output)
steve blenheim 101
betty boop 201
igor chevsky 301
norma cord 401
jon deloach 501
karen evich 601

Explanation

1. The tr operator makes a one-on-one correspondence between each character in the search string with each character in the replacement string. Each lowercase letter will be translated to its corresponding uppercase letter.

2. Each number will be translated to a colon.

3. The translation is messy here. Since the search side represents more characters than the replacement side, all letters from D to Z will be replaced with a c.

4. Each space will be replaced with pound signs (#).

5. The y is a synonym for tr. Each uppercase letter is translated to its corresponding lowercase letter.

The d Delete Option

The d (delete) option removes all characters in the search string not found in the replacement string.

The c Complement Option

The c (complement) option complements the search string; that is, it translates each character not listed in this string to its corresponding character in the replacement string.

The s Squeeze Option

The s (squeeze) option translates all characters that are repeated to a single character and can be used to get rid of excess characters, such as excess whitespace or delimiters, squeezing these characters down to just one.

9.3 Unicode

For every character, Unicode specifies a unique identification number called a code point that remains consistent across applications, languages, and platforms.

With the advent of the Internet, it became obvious that the ASCII coding for characters was insufficient if the whole world were to be included in transferring data from one Web site to another without corrupting the data. The ASCII sequence of characters consists of only 256 (one-byte) characters and could hardly accommodate languages like Chinese and Japanese, where a given symbol is drawn from a set of thousands of characters.

The Unicode standard is an effort to solve the problem by creating new characters sets, and encoding called UTF8 and UTF16, where characters are not limited to one byte. UTF8, for example, allows two bytes that can hold up to 65,536 characters, and each character has a unique number. To remove ambiguity, any given 16-bit value would always represent the same character, thereby allowing for consistent sorting, searching, displaying, and editing of text. According to the Unicode Consortium,⁶ Unicode has the capacity to encode over one million characters, which is sufficient to encompass all the world’s written languages. Further, all symbols are treated equally, so that all characters can be accessed without the need for escape sequences or control codes.

6. The Unicode Consortium is a nonprofit organization founded to develop, extend, and promote use of the Unicode standard. For more information on Unicode and the Unicode Consortium, go to www.unicode.org/unicode/standard/whatisunicode.html.

9.3.1 Perl and Unicode

“The days of just flinging strings around are over. It’s well established that modern programs need to be capable of communicating funny accented letters, and things like euro symbols. This means that programmers need new habits. It’s easy to program Unicode capable software, but it does require discipline to do it right.”

— Perlunitut

The largest change in Perl 5.6 was to provide UTF8 Unicode support. By default, Perl represents strings internally in Unicode, and all the relevant built-in functions (length, reverse, sort, tr) now work on a character-by-character basis instead of on a byte-by-byte basis. Two Perl pragmas are used to turn Unicode settings on and off. The utf8 pragma turns on the Unicode settings and loads the required character tables, while the bytes pragma refers to the old byte meanings, reading one byte at a time. (For a complete discussion of see perldoc.perl.org/perlunicode.html.)

To find out what character encoding your version of Perl uses, type at the prompt:

Click here to view code image

$ perl -MEncode -le "print for encodings(':all')"
ascii
ascii-ctrl
iso-8859-1
null
utf-8-strict
utf8
(This output is for Perl5.16 )

When utf8 is turned on, you can specify string literals in Unicode using the x{Number} notation for characters (called code points) 0xFF and above (see www.unicode-table.com) where Number is a hexadecimal character code such as x{395}. See Figure 9.3.

Figure 9.3 The unicode-table.com Web site.

You can also use the N{U+hexnumber} notation where hexnumber in the braces is the hexadecimal number for the Unicode character; for example, a smiley face is N{U+263A}, or use the official name for the Unicode character, N{WHITE SMILING FACE}. For a list of Unicode character names, see www.unicode.org/charts/charindex.html.

EXAMPLE 9.52

Click here to view code image

1  use 5.012;  # use feature 'unicode strings'
2  my $smiley="N{U+263A}";  # Unicode smiley character
3  utf8::encode($smiley);
4  print "Smiley face is $smiley ";
5  my $swring="x{00E5}";  # Unicode for Swedish ring, Decimal 229
6  utf8::encode($swring);
7  print "Swedish ringed a is $swring ";
   my $symbol = "N{UMBRELLA}";  # Name the code point
8  utf8::encode $symbol;
9  print "Umbrella is $symbol ";

(Output)
Smiley face is
Swedish ringed a is å
Umbrella is

Explanation

1. From perldoc: In order to preserve backward compatibility, Perl does not turn on full internal Unicode support unless the pragma use feature 'unicode_strings' is specified (automatically selected if you use use 5.012 or higher).

This example is using 5.016, making this line not necessary.

2. Using the N notation, the U+ stands for “add Unicode” and the Unicode character for a smiley character is 263A. Instead of the number, you could use the name N{U+WHITE SMILING FACE}.

3. The utf8 encoding function will encode the smiley face Unicode into a readable character.

4. The Unicode smiley face is printed after encoding.

5. you can use the x{...} notation for characters 0x100 and above. This time, the notations for a Swedish ring is x{00E5}.

7. With the N{...} notation, you can put the official Unicode character name within the braces; in this example, the name for an umbrella symbol.

8. The encode function changes the native bytes of a Perl scalar to UTF-8 bytes. See http://perldoc.perl.org/5.8.9/utf8.html.

Unicode also provides support for regular expressions and matching characters based on Unicode properties, some of which are defined by the Unicode standard and some by Perl. The Perl properties are composites of the standard properties; in other words, you can now match any uppercase character in any language with p{IsUpper}.

Table 9.10 is a list of Perl’s composite character classes. If the p in p is capitalized, the meaning is a negation; so, for example, p{IsASCII} represents an ASCII character, whereas P{IsASCII} represents a non-ASCII character.

Table 9.10 utf8 Composite Character Classes

Explanation

1. The utf8 pragma is used to turn on the Unicode settings. Even in modern Perl, utf-8 is not a default.

2. Scalar $chr is assigned a number.

3. The Perl Unicode property IsDigit is used to check for a number between 0 and 9, the same as using [0-9].

4. Scalar $chr is assigned the string junk.

5. The p is now P, causing the escape sequence to mean not a digit, the same as using [^0-9]. Since junk is not a digit, the condition is true.

6. The opposite of junk is not a control character.

9.4 What You Should Know

1. What are metacharacters used for?

2. What is a character class?

3. What is meant by a “greedy” metacharacter?

4. What is an anchoring metacharacter?

5. How do you search for a literal period?

6. What is capturing? Can you turn it off?

7. What is grouping?

8. How does a character class differ from alternation?

9. How do you search for one or more digits?

10. How do you search for zero or one digit?

11. What is a metasymbol?

12. What is the purpose of the “squeeze” option when used with tr?

13. What is utf8?

9.5 What’s Next?

In the next chapter, we discuss how Perl deals with files, how to open them, read from them, write to them, append to them, and close them. You will learn how die works. You will learn how to seek to a position within a file, how to rewind back to the top, how to mark a spot for the next read operation. You will learn how to perform file tests to see if a file is readable, writeable, executable, and so forth. We will also discuss pipes, how Perl sends output to a pipe, and how Perl reads from a pipe. You will learn how to pass arguments to a Perl script at the command line and all the variations of ARGV.

Exercise 9: And the Search Goes On . . .

(Sample file found on CD)

Tommy Savage:408-724-0140:1222 Oxbow Court, Sunnyvale,CA 94087:5/19/66:34200

Lesle Kerstin:408-456-1234:4 Harvard Square, Boston, MA 02133:4/22/62:52600

JonDeLoach:408-253-3122:123 Park St., San Jose, CA 94086:7/25/53:85100

Ephram Hardy:293-259-5395:235 Carlton Lane, Joliet, IL 73858:8/12/20:56700

Betty Boop:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500

Wilhelm Kopf:846-836-2837:6937 Ware Road, Milton, PA 93756:9/21/46:43500

Norma Corder:397-857-2735:74 Pine Street, Dearborn, MI 23874:3/28/45:245700

James Ikeda:834-938-8376:23445 Aster Ave., Allentown, NJ 83745:12/1/38:45000

Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200

Barbara Kerz:385-573-8326:832 Ponce Drive, Gary, IN 83756:12/15/46:268500

1. Print the city and state where Norma lives.

2. Give everyone a $250.00 raise.

3. Calculate Lori’s age.

4. Print lines 2 through 6. (The $. variable holds the current line number.)

5. Print names and phone numbers of those in the 408 area code.

6. Print names and salaries in lines 3, 4, and 5.

7. Print a row of stars after line 3.

8. Change CA to California.

9. Print the file with a row of stars after the last line.

10. Print the names of the people born in March.

11. Print all lines that don’t contain Karen.

12. Print lines that end in exactly five digits; no more, no less.

13. Print the file with the first and last names reversed with only the first letter of the first name and the full last name; for example, Savage,

14. Print all cities in California, and the first names of those people who live there.

15. Without using the split function, print all the lines up to the first colon (just the names).

16. Without using the split function, print the street address; for example, 123 Park St.

17. Create and display a new format for all the phone numbers to look like this:

(408) 465-1234

18. Print a smiley face, a heart, and a black chess knight after line 6.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Getting Control—Regular Expression Metacharacters

Create new playlist

Sign In

Sign Up

Chapter 9. Getting Control—Regular Expression Metacharacters

9.1 The RegExLib.com Library

9.2 Regular Expression Metacharacters

9.2.1 Metacharacters for Single Characters

The Dot Metacharacter

The s Modifier—The Dot Metacharacter and the Newline

The Character Class

The POSIX Bracket Expressions

9.2.2 Whitespace Metacharacters

9.2.3 Metacharacters to Repeat Pattern Matches

The Greed Factor

Metacharacters That Turn off Greediness

Anchoring Metacharacters

The m Modifier

Alternation

Grouping or Clustering

Remembering or Capturing

Turning off Greed

Turning off Capturing

Metacharacters That Look Ahead and Behind

9.2.4 The tr or y Operators

The d Delete Option

The c Complement Option

The s Squeeze Option

9.3 Unicode

9.3.1 Perl and Unicode

9.4 What You Should Know

9.5 What’s Next?

Exercise 9: And the Search Goes On . . .

Table of Contents for
Chapter 9. Getting Control—Regular Expression Metacharacters