Chapter 9. Getting Control—Regular Expression Metacharacters

Image

By the end of this chapter, you will be able to unravel and use the following regular expressions:

die unless (/^.+@[^.].*.[a-z]{2,}$/);

$money =~ s/(?<=d)(?=(ddd)+(?!d))/,/g

9.1 The RegExLib.com Library

Before getting deep into the weeds, let’s take a look at the regexlib.com Web site. This Web site allows you to search for a pattern and will show you a list of regular expression solutions and a rating on how well each one performs its pattern-matching task. Although the Web site may not be 100 percent Perlish in the way it handles regexes, it is certainly a good research tool when you’re trying to get some clues on how to write your own. The following is the opening statement found at the home page of RegExLib.com (also shown in Figure 9.1).

Image

Figure 9.1 The RegExLib.com home page.

Welcome to RegExLib.com, the Internet’s first Regular Expression Library. Currently we have indexed 3800 expressions from 2172 contributors around the world. We hope you’ll find this site useful and come back whenever you need help writing an expression, you’re looking for an expression for a particular task, or are ready to contribute new expressions you’ve just figured out. Thanks!

If you look closely at Figure 9.1, you will see a magnifying glass with a search box next to it. In this box, the word email has been typed. If the search button is clicked, another page will appear with a variety of regular expressions that have been designed by different programmers to match for a valid email address (see Figure 9.2). The purpose of each regex is defined and given a rating (the number of green boxes) on its quality, much like grading a hotel. The more green boxes, the better the regex—five being the best, as in a five-star hotel.

Image

Figure 9.2 Search results.

By the time we finish this chapter, you should be able to read any of the regex examples found here. Once you understand all the metacharacters and how they are used, you can write you own regular expressions or use the ones provided here. Knowing what the regular expression is matching on and being able to test it right at the Web site is a great time-saving tool. In Figure 9.2, you can see some examples of how to validate an email address. Note the ratings, the test box, and the description.

9.2 Regular Expression Metacharacters

So what are these metacharacters? Regular expression metacharacters are characters that do not represent themselves. They are endowed with special powers to allow you to control the search pattern in some way (for example, find the pattern only at the beginning of the line, or at the end of the line, or only if it starts with an upper- or lowercase letter). Metacharacters lose their special meaning if preceded with a backslash (). For example, the dot metacharacter represents any single character, but when preceded with a backslash, is just a dot or period.

If you see a backslash preceding a metacharacter, the backslash turns off the meaning of the metacharacter, but if you see a backslash preceding an alphanumeric character in a regular expression, then the backslash means something else; for example, d means one decimal number. Perl provides a simpler form of some of the metachacters, called metasymbols, to represent characters. For example, [0-9] represents numbers in the range between 0 and 9, and d represents the same thing. [0-9] uses the bracket metacharacter; d is a metasymbol. Table 9.1 describes the metacharacters and what they do.

Image
Image
Image

Table 9.1 Metacharacters

9.2.1 Metacharacters for Single Characters

If you are searching for a particular character within a regular expression, you can use the dot metacharacter to represent a single character or a character class that matches one character from a set of characters. In addition to the dot and character class, Perl has added some backslashed symbols (called metasymbols) to represent single characters. See Table 9.2.1

Image

Table 9.2 Metacharacters for Single Characters

1. The metasymbols match on more than just the alphanumeric characters; they are “Unicode” aware.

The Dot Metacharacter

The dot (.) metacharacter matches any single character with the exception of the newline character. For example, the regular expression /a.b/ is matched if the string contains an a, followed by any one single character (except the ), followed by b, whereas the expression /.../ matches any string containing at least three characters.

The s Modifier—The Dot Metacharacter and the Newline

Normally, the dot metacharacter does not match the newline character, , because it matches only the characters within a string up until the newline is reached. The s modifier treats the line with embedded newlines as a single line, rather than a group of multiple lines, and allows the dot metacharacter to treat the newline character the same as any other character it might match. The s modifier can be used with both the m (match) and the s (substitution) operators.

The Character Class

A character class represents one character from a set of characters. For example, [abc] matches an a, b, or c, and [a-z] matches one character from a set of characters in the range from a to z, and [0-9] matches one character in the range of digits between 0 and 9. If the character class contains a leading caret (^), then the class represents any one character not in the set; for example, [^a-zA-Z] matches a single character not in the range from a to z or A to Z, and [^0-9] matches a single character not in the range between 0 and 9.2 To represent a number between 10 and 13, use 1[0-3], not [10-13].

2. Don’t confuse the caret inside square brackets with the caret used as a beginning of line anchor. See Table 9.7.

Perl provides additional symbols, metasymbols, to represent a character class. The symbols d and D represent a single digit and a single nondigit, respectively; they are the same as [0-9] and [^0-9]. Similarly, w and W represent a single word character and a single nonword character, respectively; they are the same as [A-Za-z_0-9] and [^A-Za-z_0-9].

The POSIX Bracket Expressions

Perl 5.6 introduced the POSIX, a special kind of character classes, called bracket expressions. POSIX (the Portable Operating System Interface3) is an industry standard used to ensure that programs are portable across operating systems. In order to be portable, POSIX recognizes that different countries or locales may vary in the way characters are encoded, the symbols used to represent currency, and how times and dates are represented. To handle these different types of characters, POSIX (the bracketed character class of characters) is used (see Table 9.3). The POSIX module permits you to access all (or nearly all) the standard POSIX 1003.1 identifiers.

Image

Table 9.3 The Bracketed Character Class

3. POSIX is a registered trademark of the IEEE. See http://www.opengroup.org/austin/papers/backgrounder.html.

The class [:alnum:] is another way of saying A-Za-z0-9. To use this class, it must be enclosed in another set of brackets for it to be recognized as a regular expression. For example, A-Za-z0-9, by itself, is not a regular expression character class, but [A-Za-z0-9] is. Likewise, [:alnum:] should be written [[:alnum:]]. The difference between using the first form, [A-Za-z0-9], and the bracketed form, [[:alnum:]], is that the first form is dependent on ASCII character encoding, whereas the second form allows characters from other languages to be represented in the class. (For more on POSIX expressions, see www.regular-expressions.info/posixbrackets.html.)

To negate one of the characters in the POSIX character class, the syntax is as follows:

[^[:space:]] - all nonwhitespace characters

9.2.2 Whitespace Metacharacters

A whitespace character represents a space, tab, return, newline, or form feed. The whitespace character can be represented literally, by pressing a Tab key or the spacebar or the Enter key. See Table 9.4.

Image

Table 9.4 Whitespace Metacharacters

9.2.3 Metacharacters to Repeat Pattern Matches

In the previous examples, the metacharacter matched on a single character. What if you want to match on more than one character? For example, let’s say you are looking for all lines containing names, and the first letter must be in uppercase—which can be represented as [A-Z]—but the following letters are lowercase, and the number of letters varies in each name. [a-z] matches on a single lowercase letter. How can you match on one or more lowercase letters? Or zero or more lowercase letters? To do this, you can use what are called quantifiers. To match on one or more lowercase letters, the regular expression can be written /[a-z]+/ where the + sign means “one or more of the previous characters,” which in this case is one or more lowercase letters. Perl provides a number of quantifiers, as shown in Table 9.5.

Image

Table 9.5 The Greedy Metacharacters

The Greed Factor

Normally, quantifiers are greedy; in other words, they match on the largest possible set of characters starting at the left-hand side of the string and searching to the right, look for the last possible character that would satisfy the condition. For example, given the following string:

$_="ab123456783445554437AB"

and the regular expression

s/ab[0-9]*/X/;

the search side would match

ab123456783445554437

All of this will be replaced with an X. After the substitution, $_ would be

XAB

The asterisk (*) is a greedy metacharacter. It matches for zero or more of the preceding characters. In other words, it attaches itself to the character preceding it and looks only for zero or more occurrences of that character. In the preceding example, the asterisk attaches itself to the character class [0-9]. The matching starts on the left, searching for ab followed by zero or more numbers in the range between 0 and 9. The matching continues until the last number is found; in this example, the number 7. The pattern ab and all of the numbers in the range between 0 and 9 are replaced with a single X. The trailing characters, AB, remain.

Greediness can be turned off so that instead of matching on the greatest number of characters, the match is made on the least number of characters found. This is done by appending a question mark after the greedy metacharacter (see Example 9.23).

Metacharacters That Turn off Greediness

By placing a question mark after a greedy quantifier, the greed is turned off, and the search ends after the first match rather than the last one. Table 9.6 describes the metacharacters that turn off greediness.

Image

Table 9.6 Turning off Greediness

Anchoring Metacharacters

Often, it is necessary to anchor a metacharacter so that it matches only if the pattern is found at the beginning or end of a line, word, or string. These metacharacters are based on a position just to the left or to the right of the character that is being matched. Anchors (see Table 9.7) are technically called zero-width assertions because they correspond to positions, not actual characters in a string. For example, /^abc/ means: find abc at the beginning of the line, where the ^ represents a position, not an actual character.

Image

Table 9.7 Anchors (Assertions)

The m Modifier

The m modifier is used to control the behavior of the $ and ^ anchor metacharacters. A string containing newlines will be treated as multiple lines. If the regular expression is anchored with the ^ metacharacter, and that pattern is found at the beginning of any one of the multiple lines, the match is successful. Likewise, if the regular expression is anchored by the $ metacharacter (or ) at the end of any one of the multiple lines, and the pattern is found, it too will return a successful match. The m modifier has no effect with A and z.

Alternation

Alternation allows the regular expression to contain alternative patterns to be matched. For example, the regular expression /John|Karen|Steve/ will match a line containing John or Karen or Steve. If Karen, John, or Steve are all on different lines, all lines are matched. Each of the alternative expressions is separated by a vertical bar (pipe symbol) and the expressions can consist of any number of characters, unlike the character class that matches for only one character; for example, /a|b|c/ is the same as [abc], whereas /ab|de/ cannot be represented as [abde]. The pattern /ab|de/ is either ab or de, whereas the class [abcd] represents only one character in the set, a, b, c, or d.

Grouping or Clustering

If the regular expression pattern is enclosed in parentheses, a subpattern is created. Then, for example, instead of the greedy metacharacters matching on zero, one, or more of the previous single characters, they can match on the previous subpattern. Alternation can also be controlled if the patterns are enclosed in parentheses. This process of grouping characters together is also called clustering by the Perl wizards.

Remembering or Capturing

If the regular expression pattern is enclosed in parentheses, a subpattern is created. The subpattern is saved in special numbered scalar variables, starting with $1, then $2, and so on. These variables can be used later in the program and will persist until another successful pattern match occurs, at which time they will be cleared. Even if the intention was to use grouping to create as shown in the previous examples, the subpatterns are saved as a side effect.4

4. It is possible to prevent a subpattern from being saved.

Turning off Greed

Greed can be turned off using the question mark (?) character.

Turning off Capturing

When the only purpose is to use the parentheses for grouping, and you are not interested in saving the subpatterns in $1, $2, or $3, the special ?: metacharacter can be used to suppress the capturing of the subpattern.

Metacharacters That Look Ahead and Behind

Suppose you want to find and replace words in a document that are followed by a comma. In your search string, you have the word you are looking for followed by the comma as part of the search criteria, but you want to exclude the comma when replacing the word. Looking ahead for a pattern that will be matched and then excluded, in this case the comma, is called a positive look ahead. A negative look ahead would look ahead for a character that is not there.

A positive look ahead is an assertion like the ^ and $ anchors in that it represents a position in the search. A regular expression contains the positive look ahead as /regex (?=pattern)/. So for example, if you say s/John (?=Doe)/Jane/, the regex engine will search for John and look ahead to see if Doe follows, and if it does, then the positive look ahead match is true and Doe is completely discarded (and will not be captured in $1). Doe will not be included in what is replaced. Only John will be replaced with Jane.

A negative look ahead looks ahead to see if the pattern (?!pattern) is not there, and if it is not, succeeds, discarding the pattern after the ?!.

With a positive look behind, Perl looks backward in the string for a pattern (?<=pattern) and if that pattern is found, will then continue pattern matching on the regular expression, discarding the pattern in parentheses. A negative look behind looks behind in the string to see if a pattern (?<!pattern) is not there, and if it is not, succeeds in the matching. See Table 9.8.

Image

Table 9.8 Look Around Assertions

9.2.4 The tr or y Operators

The tr operator translates characters on a one-to-one basis. To see what this means, let’s compare translation to substitution. You can see in the following example that the syntax for both the tr operator and substitution operator look very much the same, but they are really quite different in what they do. Let’s take a look at substitution first:

$str = "Elizabeth likes little baby lizards. ";
$str =~ s/Elizabeth/Christopher/;
print "$str ";

and the result is:

Christopher likes little baby lizards.

Now let’s look at the tr function.

$str = "Elizabeth likes little baby lizards. ";
$str =~ tr/Elizabeth/Christopher/;
print "$str ";

and the result is:

Christoph hrkos hrppho tsty hrisrds.

What is different? The s operator searches for a pattern and replaces it with a string; meaning, Elizabeth is replaced with Christopher. The tr operator5 translates characters, on a one-on-one correspondence, from each character in the search string to its corresponding character in the replacement string and returns the number of characters it replaced. In the preceding example, every E in $str, is translated to a corresponding C, every l is translated to an h, every i is transalted to an r, and so on.

5. The Perl tr function is derived from the UNIX tr command.

The tr operator does not interpret regular expression metacharacters but allows a dash to represent a range of characters. The letter y can be used in place of tr. This strangeness comes from UNIX, where the sed utility has a y command to translate characters, similar to the UNIX tr. If you look at the UNIX tr man page, you can see that it is very similar to the Perl tr function, illustrating the role UNIX has played in the development of Perl.

The d option deletes the search string.

The c option complements the search string.

The s option is called the squeeze option. Multiple occurrences of characters found in the search string are replaced by a single occurrence of that character (for example, you may want to replace multiple tabs with single tabs). See Table 9.9 for a list of modifiers.

Image

Table 9.9 tr Modifiers

The d Delete Option

The d (delete) option removes all characters in the search string not found in the replacement string.

The c Complement Option

The c (complement) option complements the search string; that is, it translates each character not listed in this string to its corresponding character in the replacement string.

The s Squeeze Option

The s (squeeze) option translates all characters that are repeated to a single character and can be used to get rid of excess characters, such as excess whitespace or delimiters, squeezing these characters down to just one.

9.3 Unicode

For every character, Unicode specifies a unique identification number called a code point that remains consistent across applications, languages, and platforms.

With the advent of the Internet, it became obvious that the ASCII coding for characters was insufficient if the whole world were to be included in transferring data from one Web site to another without corrupting the data. The ASCII sequence of characters consists of only 256 (one-byte) characters and could hardly accommodate languages like Chinese and Japanese, where a given symbol is drawn from a set of thousands of characters.

The Unicode standard is an effort to solve the problem by creating new characters sets, and encoding called UTF8 and UTF16, where characters are not limited to one byte. UTF8, for example, allows two bytes that can hold up to 65,536 characters, and each character has a unique number. To remove ambiguity, any given 16-bit value would always represent the same character, thereby allowing for consistent sorting, searching, displaying, and editing of text. According to the Unicode Consortium,6 Unicode has the capacity to encode over one million characters, which is sufficient to encompass all the world’s written languages. Further, all symbols are treated equally, so that all characters can be accessed without the need for escape sequences or control codes.

6. The Unicode Consortium is a nonprofit organization founded to develop, extend, and promote use of the Unicode standard. For more information on Unicode and the Unicode Consortium, go to www.unicode.org/unicode/standard/whatisunicode.html.

9.3.1 Perl and Unicode

“The days of just flinging strings around are over. It’s well established that modern programs need to be capable of communicating funny accented letters, and things like euro symbols. This means that programmers need new habits. It’s easy to program Unicode capable software, but it does require discipline to do it right.”

— Perlunitut

The largest change in Perl 5.6 was to provide UTF8 Unicode support. By default, Perl represents strings internally in Unicode, and all the relevant built-in functions (length, reverse, sort, tr) now work on a character-by-character basis instead of on a byte-by-byte basis. Two Perl pragmas are used to turn Unicode settings on and off. The utf8 pragma turns on the Unicode settings and loads the required character tables, while the bytes pragma refers to the old byte meanings, reading one byte at a time. (For a complete discussion of see perldoc.perl.org/perlunicode.html.)

To find out what character encoding your version of Perl uses, type at the prompt:

$ perl -MEncode -le "print for encodings(':all')"
ascii
ascii-ctrl
iso-8859-1
null
utf-8-strict
utf8
(This output is for Perl5.16 )

When utf8 is turned on, you can specify string literals in Unicode using the x{Number} notation for characters (called code points) 0xFF and above (see www.unicode-table.com) where Number is a hexadecimal character code such as x{395}. See Figure 9.3.

Image

Figure 9.3 The unicode-table.com Web site.

You can also use the N{U+hexnumber} notation where hexnumber in the braces is the hexadecimal number for the Unicode character; for example, a smiley face is N{U+263A}, or use the official name for the Unicode character, N{WHITE SMILING FACE}. For a list of Unicode character names, see www.unicode.org/charts/charindex.html.

Unicode also provides support for regular expressions and matching characters based on Unicode properties, some of which are defined by the Unicode standard and some by Perl. The Perl properties are composites of the standard properties; in other words, you can now match any uppercase character in any language with p{IsUpper}.

Table 9.10 is a list of Perl’s composite character classes. If the p in p is capitalized, the meaning is a negation; so, for example, p{IsASCII} represents an ASCII character, whereas P{IsASCII} represents a non-ASCII character.

Image

Table 9.10 utf8 Composite Character Classes

9.4 What You Should Know

1. What are metacharacters used for?

2. What is a character class?

3. What is meant by a “greedy” metacharacter?

4. What is an anchoring metacharacter?

5. How do you search for a literal period?

6. What is capturing? Can you turn it off?

7. What is grouping?

8. How does a character class differ from alternation?

9. How do you search for one or more digits?

10. How do you search for zero or one digit?

11. What is a metasymbol?

12. What is the purpose of the “squeeze” option when used with tr?

13. What is utf8?

9.5 What’s Next?

In the next chapter, we discuss how Perl deals with files, how to open them, read from them, write to them, append to them, and close them. You will learn how die works. You will learn how to seek to a position within a file, how to rewind back to the top, how to mark a spot for the next read operation. You will learn how to perform file tests to see if a file is readable, writeable, executable, and so forth. We will also discuss pipes, how Perl sends output to a pipe, and how Perl reads from a pipe. You will learn how to pass arguments to a Perl script at the command line and all the variations of ARGV.

Exercise 9: And the Search Goes On . . .

(Sample file found on CD)

Tommy Savage:408-724-0140:1222 Oxbow Court, Sunnyvale,CA 94087:5/19/66:34200

Lesle Kerstin:408-456-1234:4 Harvard Square, Boston, MA 02133:4/22/62:52600

JonDeLoach:408-253-3122:123 Park St., San Jose, CA 94086:7/25/53:85100

Ephram Hardy:293-259-5395:235 Carlton Lane, Joliet, IL 73858:8/12/20:56700

Betty Boop:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500

Wilhelm Kopf:846-836-2837:6937 Ware Road, Milton, PA 93756:9/21/46:43500

Norma Corder:397-857-2735:74 Pine Street, Dearborn, MI 23874:3/28/45:245700

James Ikeda:834-938-8376:23445 Aster Ave., Allentown, NJ 83745:12/1/38:45000

Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200

Barbara Kerz:385-573-8326:832 Ponce Drive, Gary, IN 83756:12/15/46:268500

1. Print the city and state where Norma lives.

2. Give everyone a $250.00 raise.

3. Calculate Lori’s age.

4. Print lines 2 through 6. (The $. variable holds the current line number.)

5. Print names and phone numbers of those in the 408 area code.

6. Print names and salaries in lines 3, 4, and 5.

7. Print a row of stars after line 3.

8. Change CA to California.

9. Print the file with a row of stars after the last line.

10. Print the names of the people born in March.

11. Print all lines that don’t contain Karen.

12. Print lines that end in exactly five digits; no more, no less.

13. Print the file with the first and last names reversed with only the first letter of the first name and the full last name; for example, Savage,

14. Print all cities in California, and the first names of those people who live there.

15. Without using the split function, print all the lines up to the first colon (just the names).

16. Without using the split function, print the street address; for example, 123 Park St.

17. Create and display a new format for all the phone numbers to look like this:

(408) 465-1234

18. Print a smiley face, a heart, and a black chess knight after line 6.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.178.237