Chapter 8. Regular Expressions—Pattern Matching

Image

By the end of this chapter, you should understand the following Perl code:

while(<DATA>){
   ($name,$price)=split(":");
   $price =~ s/$price/$& + ($& * .20)/e;
   printf "%s $%.2f ",$name, $price if $price > 35;
}
_ _DATA_ _
Ramya:30.25
Lu:12.66
Shiva:65.75
Dee:44.32

8.1 What Is a Regular Expression?

This is a regular expression: /love/. This is also a regular expression: /^.+$/.

A regular expression (regex) is really just a sequence, or pattern, of characters that is matched against a string of text when performing searches and replacements. Regexes have been around for a long time and most modern programming languages have libraries that support them, such as the popular library called PCRE, short for Perl Compatible Regular Expressions, an open-source library compatible with a great number of C compilers and operating systems. If you are familiar with UNIX utilities, such as vi, sed, grep, and awk, you have already met regular expressions face-to-face. Under Perl, regular expressions have evolved into a powerful tool, unsurpassed by any modern programming language and a major reason for Perl’s rise to fame.

A simple regular expression consists of a character or set of characters that matches itself. The regular expression is normally delimited by forward slashes.1 The special scalar $_ is the default search space where Perl does its pattern matching. $_ is like a shadow. Sometimes you see it; sometimes you don’t. Don’t worry; all this will become clear as you read through this chapter.

1. Actually, you can use any character as a delimiter. See Table 8.1 and Example 8.12.

8.1.1 Why Do We Need Regular Expressions?

Before going into the details, you may wonder what regular expressions do. A good example would be the validation of input from a user, searching for sequences in a file, or replacing one value with another based on a specified search criteria. When you fill out a form to buy a product online, a program must be able to validate that the information is correct. Let’s say you type in your email address into one of the text boxes in the form. The program running behind the scenes (JavaScript, PHP, Perl, or the like) will check to see if the email address is valid. How would the program perform the test? Enter regular expressions, commonly called regexes or re’s. The following is an example of a regular expression used to validate an email address:

/^w+[w-.]*@w+((-w+)|(w*)).[a-z]{2,3}$/

It looks like gibberish at first sight. The only character you might recognize in the example as part of an email address is the @ symbol. The other characters are special regular expression metacharacters and have a special meaning; for example, the w+ represents any set of alphanumeric characters (a-z, A-Z, 0-9). This regular expression is designed to search for a specific pattern of characters that would be found in an email address. Finding the exact pattern can be very time-consuming and may not be the correct expression for all valid emails, requiring numerous tests before it is established as a comprehensive validation. You can find various regex libraries on the Web to help you unravel regular expression patterns (see also Chapter 9, “Getting Control—Regular Expression Metacharacters”), such as the example shown previously, and not all of them are that complex.

This chapter will show you how to use regular expressions for simple pattern matching and substitution, how to use them with the conditional and looping modifiers, and how to use the various regex options to further define the expression. In Chapter 9, regular expression metacharacters will be explained to let you further control and refine the search pattern as shown in the email example, and that is where the real power of pattern matching lies.

8.2 Modifiers and Simple Statements with Regular Expressions

A simple statement is an expression terminated with a semicolon. To review from Chapter 7, “If Only, Unconditionally, Forever,” we saw that Perl supports a set of modifiers that allow you to further evaluate a statement based on some condition. A simple statement may contain an expression ending with a single modifier. The modifying expression is always terminated with a semicolon. When evaluating regular expressions, the modifiers may be simpler to use than the full-blown conditional constructs.

8.2.1 Pattern Binding Operators

The pattern binding operators are used to bind a string being searched for with the pattern that specifies the search. In the previous examples, most of the pattern searches were done implicitly (or explicitly) on the $_ variable, the default pattern space. That is, each line was stored in the $_ variable when looping through a file. We’ve also seen that if you store a value in some variable other than $_, you will need the pattern matching operators (see Table 8.1).

Image

Table 8.1 Pattern Matching Operators

Instead of using $_ as in the following line:

$_ = 5000;

you could use another named scalar, like so:

$salary = 5000;

Then, if a match is performed on $salary instead of $, you would use this:

$salary =~ /5/; or  $salary !~ /5/;

So, if you have a string that is not stored in the $_ variable and need to perform matches or substitutions on that string, the pattern binding operators = ~ or ! ~ are used. They are also used with the tr function for string translations (for more on tr, see Section 9.2.4, “The tr or y Operators”). This doesn’t mean that you can’t use the pattern matching operators with the $_ variable; it just means that if you’re not using $_, then you will need them.

8.2.2 The DATA Filehandle

In the following examples, the special filehandle called DATA is used as an expression in a while loop. This allows us to directly get the data from the same script that is testing it, rather than reading input from a separate text file. (In fact, you may find this technique handy if you are testing some specific sections of an external file. Just copy the lines in question into your script, place them under the _ _DATA_ _ special literal, and run your tests within the script.) The data itself is located after the _ _DATA_ _2 special literal at the bottom of each of the example scripts. The _ _DATA_ _ literal marks the logical end of the script and opens the DATA filehandle for reading. Each time a line of input is read from <DATA>, it is assigned by default to the special $_ scalar. Although $_ is implied, you could also use it explicitly, or even some other scalar. The format used is shown in the following examples.

2. Instead of _ _DATA_ _, you can use _ _END_ _, but _ _END_ _ opens the DATA filehandle in the main package and _ _DATA_ _ in any package.

8.3 Regular Expression Operators

The regular expression operators are used for matching patterns in searches and for replacements in substitution operations. The m operator is used for matching patterns, and the s operator is used when substituting one pattern for another.

8.3.1 The m Operator and Pattern Matching

The m operator is optional if the delimiters enclosing the regular expression are forward slashes (the forward slash is the default), but required if you change the delimiter. You may want to change the delimiter if the regular expression itself contains forward slashes (for example, when searching for birthdays, such as 3/15/93, or pathnames, such as /usr/var/adm). Matching modifiers are shown in Table 8.2.

Image

Table 8.2 Matching Modifiers

The g Modifier—Global Match

The g modifier is used to cause a global match; in other words, all occurrences of a pattern in the line are matched. Without the g, only the first occurrence of a pattern is matched. The m operator will return a list of the patterns matched.

The i Modifier—Case Insensitivity

Perl is sensitive to whether characters are upper- or lowercase when performing matches. If you want to turn off case sensitivity, an i (insensitive) is appended to the last delimiter of the match operator.

Special Scalars for Saving Patterns

The $& special scalar is assigned the string that was matched in the last successful search. &' saves what was found preceding the pattern that was matched, and &' saves what was found after the pattern that was matched.

The x Modifier—The Expressive Modifier

The x modifier allows you to place comments within the regular expression and add whitespace characters (spaces, tabs, newlines) for clarity without having those characters interpreted as part of the regular expression; in other words, you can express your intentions within the regular expression.

8.3.2 The s Operator and Substitution

The s operator is used for substitutions. The substitution operator replaces the search pattern in the first set of slashes, and if found, replaces it with what is found within the second set of forward slashes. The delimiter can also be changed. The g modifier placed after the last delimiter stands for global change on a line, so that if Perl finds multiple occurrences of the pattern on a line, it will replace all of them, not just the first one it finds. The return value from the s operator is the number of substitutions that were made.

The special built-in variable $&, used in the replacement side of the substitution, gets the value of whatever was found in the search string. $& is a read-only variable. It cannot be changed.

8.3.3 The Pattern Binding Operators with Substitution

You can also use the pattern binding operators, used to bind the string being searched for with the pattern specifying the search, with substitution. In the previous examples, most of the substitutions were done implicitly (or explicitly) on the $_ variable, the default pattern space. That is, each line was stored in the $_ variable when looping through a file. We’ve also seen that if you store a value in some variable other than $_, you will need the pattern matching operators (see Table 8.3).

Image

Table 8.3 Pattern Matching Operators with Substitution

Instead of

$_ = "John";

we will use a named variable

$name = "John";

Then if a substitution is performed on $name instead of $_, as in

print if s/John/Sam/;

you would write

print if $name =~ s/John/Sam/;

Changing the Substitution Delimiters

Normally, the forward slash delimiter encloses both the search pattern and the replacement string. You can use any non-alphanumeric character following the s operator in place of the slash. For example, if a # follows the s operator, you must use it as the delimiter for the replacement pattern. If you use pairs of parentheses, curly braces, square brackets, or angle brackets to delimit the search pattern, you may use any other type of delimiter for the replacement pattern, such as s(John) /Joe/;.

Substitution Modifiers

You can control the way the substitution is performed by a number of special modifiers; for example, you can turn off case sensitivity, evaluate the replacement side, make global subsitutions, and so forth. Table 8.4 lists those modifiers.

Image

Table 8.4 Substitution Modifiers

The g Modifier—Global Substitution

The g modifier is used to cause a global substitution; that is, all occurrences of a pattern are replaced on the line. Without the g, only the first occurrence of a pattern on each line is changed.

The i Modifier—Case Insensitivity

Perl is sensitive to upper- or lowercase characters when performing matches. If you want to turn off case sensitivity, an i (insensitive) is appended to the last delimiter of the match or substitution operator and the search pattern will be case insensitive, whereas this has no effect on the replacement side.

The e Modifier—Evaluating an Expression

On the replacement side of a substitution operation, it is possible to evaluate an expression or a function. The search side is replaced with the result of the evaluation.

Using the Special $& Variable in a Substitution

The special $& variable is used to hold the pattern that is found on the search side of a substitution. Its value is used in the replacement side when performing an evaluation, but it is a read-only variable, meaning you cannot change it; for example, you cannot use $& += 5.

Pattern Matching with a Real File

In all the previous examples, we have been using the DATA filehandle for performing pattern matches and substitutions with regular expressions. The following examples demonstrate how you can use pattern matching when working with lines from an external file.

8.4 What You Should Know

1. What is meant by a regular expression? Why do we need them?

2. How are the if and unless modifiers used?

3. How do you change the forward slash delimiter used in the search pattern to something else?

4. What does the s operator do?

5. What is meant by a global search?

6. When do you need the pattern binding operators, = ~ and ! ~ ?

7. What is the default pattern space holder?

8. What is the _ _DATA_ _ filehandle used for?

9. What do the i, e, and g modifiers mean?

8.5 What’s Next?

In the next chapter, you will harness the power of pattern matching by learning Perl’s plethora of regular expression metacharacters. You will learn how to anchor patterns and how to search for alternating patterns, whitespace characters, sets of characters, repeating patterns, and so forth. You will learn about greedy metacharacters and how to control them. You will learn about capturing and grouping patterns, to look ahead and behind. By the time you complete that chapter, you should be able to search for data by regular expressions based on a specific criterion in order to validate the data and to modify the text that was found.

Exercise 8: A Match Made in Heaven

(sample.file found on CD)

Tommy Savage:408-724-0140:1222 Oxbow Court, Sunnyvale,CA 94087:5/19/66:34200

Lesle Kerstin:408-456-1234:4 Harvard Square, Boston, MA 02133:4/22/62:52600

JonDeLoach:408-253-3122:123 Park St., San Jose, CA 94086:7/25/53:85100

Ephram Hardy:293-259-5395:235 Carlton Lane, Joliet, IL 73858:8/12/20:56700

Betty Boop:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500

William Kopf:846-836-2837:6937 Ware Road, Milton, PA 93756:9/21/46:43500

Norma Corder:397-857-2735:74 Pine Street, Dearborn, MI 23874:3/28/45:245700

James Ikeda:834-938-8376:23445 Aster Ave., Allentown, NJ 83745:12/1/38:45000

Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200

Barbara Kerz:385-573-8326:832 Ponce Drive, Gary, IN 83756:12/15/46:268500

1. Print all lines containing the pattern Street.

2. Print lines where the first name matches a B or b.

3. Print last names that match Ker.

4. Print phone numbers in the 408 area code.

5. Print Lori Gortz’s name and address.

6. Print Ephram’s name in capital letters.

7. Print lines that do not contain a 4.

8. Change William’s name to Siegfried.

9. Print Tommy Savage’s birthday.

10. Print the names of those making over $40,000.

11. Print the names and birthdays of those people born in June.

12. Print the ZIP Codes for Massachusetts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.239.103