14. Perl-Compatible Regular Expressions

In This Chapter

Creating a Test Script 434

Defining Simple Patterns 438

Using Quantifiers 441

Using Character Classes 443

Finding All Matches 446

Using Modifiers 450

Matching and Replacing Patterns 452

Review and Pursue 456

Regular expressions are an amazingly powerful (but tedious) tool available in most of today’s programming languages and even in many applications. Think of regular expressions as an elaborate system of matching patterns. You first write the pattern and then use one of PHP’s built-in functions to apply the pattern to a value (regular expressions are applied to strings, even if that means a string with a numeric value). Whereas a string function could see if the name John is in some text, a regular expression could just as easily find John, Jon, and Jonathon.

Because the regular expression syntax is so complex, while the functions that use them are simple, the focus in this chapter will be on mastering the syntax in little bites. The PHP code will be very simple; later chapters will better incorporate regular expressions into real-world scripts.

Creating a Test Script

As already stated, regular expressions are a matter of applying patterns to values. The application of the pattern to a value is accomplished using one of a handful of functions, the most important being preg_match( ). This function returns a 0 or 1, indicating whether or not the pattern matched the string. Its basic syntax is

preg_match(pattern, subject);

The preg_match( ) function will stop once it finds a single match. If you need to find all the matches, use preg_match_all( ). That function will be discussed toward the end of the chapter.

When providing the pattern to preg_match( ), it needs to be placed within quotation marks, as it’ll be a string. Because many escaped characters within double quotation marks have special meaning (like ), I advocate using single quotation marks to define your patterns.

Secondarily, within the quotation marks, the pattern needs to be encased within delimiters. The delimiter can be any character that’s not alphanumeric or the backslash, and the same character must be used to mark the beginning and end of the pattern. Commonly, you’ll see forward slashes used. To see if the word cat contains the letter a, you would code (spoiler alert: it does):

if (preg_match('/a/', 'cat')) {

If you need to match a forward slash in the pattern, use a different delimiter, like the pipe (|) or an exclamation mark (!).

The bulk of this chapter covers all the rules for defining patterns. In order to best learn by example, let’s start by creating a simple PHP script that takes a pattern and a string Image and returns the regular expression result Image.

Image

Image The HTML form, which will be used for practicing regular expressions.

Image

Image The script will print what values were used in the regular expression and what the result was. The form will also be made sticky to remember previously submitted values.

Script 14.1. The complex regular expression syntax will be best taught and demonstrated using this PHP script.


1   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML
    1.0 Transitional//EN" "http://www.w3.org/
    TR/xhtml1/DTD/xhtml1-transitional.dtd">
2   <html xmlns="http://www.w3.org/1999/
    xhtml" xml:lang="en" lang="en">
3   <head>
4      <meta http-equiv="Content-Type"
       content="text/html; charset=utf-8" />
5      <title>Testing PCRE</title>
6   </head>
7   <body>
8   <?php // Script 14.1 - pcre.php
9   // This script takes a submitted string
    and checks it against a submitted pattern.
10
11  if ($_SERVER['REQUEST_METHOD'] = = 'POST') {
12
13     // Trim the strings:
14     $pattern = trim($_POST['pattern']);
15     $subject = trim($_POST['subject']);
16
17     // Print a caption:
18     echo "<p>The result of checking<br />
       <b>$pattern</b><br />against<br />
       $subject<br />is ";
19
20     // Test:
21     if (preg_match ($pattern,
       $subject) ) {
22        echo 'TRUE!</p>';
23     } else {
24        echo 'FALSE!</p>';
25     }
26
27  } // End of submission IF.
28  // Display the HTML form.
29  ?>
30  <form action="pcre.php" method="post">
31     <p>Regular Expression Pattern:
       <input type="text" name="pattern"
       value="<?php if (isset($pattern)) echo
       htmlentities($pattern); ?>" size="40" />
       (include the delimiters)</p>
32     <p>Test Subject: <input type="text"
       name="subject" value="<?php if (isset
       ($subject)) echo htmlentities($subject);
       ?>" size="40" /></p>
33     <input type="submit" name="submit"
       value="Test!" />
34  </form>
35  </body>
36  </html>


To match a pattern

1. Begin a new PHP document in your text editor or IDE, to be named pcre.php (Script 14.1).

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <title>Testing PCRE</title>
</head>
<body>
<?php // Script 14.1 - pcre.php

2. Check for the form submission:

if ($_SERVER['REQUEST_METHOD'] = = 'POST') {

3. Treat the incoming values:

$pattern = trim($_POST['pattern']);
$subject = trim($_POST['subject']);

The form will submit two values to this same script. Both should be trimmed, just to make sure the presence of any extraneous spaces doesn’t skew the results. I’ve omitted a check that each input isn’t empty, but you could include that if you wanted.

Note that if Magic Quotes is enabled on your server, you’ll see extra slashes added to the form data Image. To combat this, you’ll need to apply stripslashes( ) here as well:

$pattern = stripslashes(trim($_POST['pattern']));
$subject = stripslashes(trim($_POST['subject']));

Image

Image With Magic Quotes enabled on your server, the script will add slashes to certain characters, most likely making the regular expressions fail.

4. Print a caption:

echo "<p>The result of checking<br /><b>$pattern</b><br />against
<br />$subject<br />is ";

As you can see Image, the form-handling part of this script will start by printing the values submitted.

5. Run the regular expression:

if (preg_match ($pattern, $subject) ) {
  echo 'TRUE!</p>';
} else {
  echo 'FALSE!</p>';
}

To test the pattern against the string, feed both to the preg_match( ) function. If this function returns 1, that means a match was made, this condition will be TRUE, and the word TRUE will be printed. If no match was made, the condition will be FALSE and that will be stated Image.

Image

Image If the pattern does not match the string, this will be the result. This submission and response also conveys that regular expressions are case-sensitive by default.

6. Complete the submission conditional and the PHP block:

} // End of submission IF.
?>

7. Create the HTML form:

<form action="pcre.php" method="post">
  <p>Regular Expression Pattern: <input type="text" name="pattern" value="<?php if (isset($pattern)) echo htmlentities($pattern); ?>" size="40" /> (include the delimiters)</p>
  <p>Test Subject: <input type="text" name="subject" value="<?php if (isset($subject)) echo htmlentities($subject); ?>" size="40" /></p>
  <input type="submit" name="submit" value="Test!" />
</form>

The form contains two text boxes, both of which are sticky (using the trimmed version of the values). Because the two values might include quotation marks and other characters that would conflict with the form’s “stickiness,” each variable’s value is sent through htmlentities( ), too.

8. Complete the HTML page:

</body>
</html>

9. Save the file as pcre.php, place it in your Web directory, and test it in your Web browser.

Although you don’t know the rules for creating patterns yet, you could use any other literal value. Remember to use delimiters around the pattern or else you’ll see an error message Image.

Image

Image If you fail to wrap the pattern in matching delimiters, you’ll see an error message.


Tip

Some text editors, such as BBEdit and emacs, allow you to use regular expressions to match and replace patterns within and throughout several documents.



Tip

The PCRE functions all use the established locale. A locale reflects a computer’s designated country and language, among other settings.



Tip

Previous versions of PHP supported another type of regular expressions, called POSIX. These have since been deprecated, meaning they’ll be dropped from future versions of the language.


Defining Simple Patterns

Using one of PHP’s regular expression functions is really easy, defining patterns to use is hard. There are lots of rules for creating a pattern. You can use these rules separately or in combination, making your pattern either quite simple or very complex. To start, then, you’ll see what characters are used to define a simple pattern. As a formatting rule, I’ll define patterns in bold and will indicate what the pattern matches in italics. The patterns in these explanations won’t be placed within delimiters or quotes (both being needed when used within preg_match( )), just to keep things cleaner.

The first type of character you will use for defining patterns is a literal. A literal is a value that is written exactly as it is interpreted. For example, the pattern a will match the letter a, ab will match ab, and so forth. Therefore, assuming a case-insensitive search is performed, rom will match any of the following strings, since they all contain rom:

• CD-ROM

• Rommel crossed the desert.

• I’m writing a roman à clef.

Along with literals, your patterns will use meta-characters. These are special symbols that have a meaning beyond their literal value (Table 14.1). While a simply means a, the period (.) will match any single character except for a newline (. matches a, b, c, the underscore, a space, etc., just not ). To match any meta-character, you will need to escape it, much as you escape a quotation mark to print it. Hence . will match the period itself. So 1.99 matches 1.99 or 1B99 or 1299 (a 1 followed by any character followed by 99) but 1.99 only matches 1.99.

Table 14.1. Meta-Characters

Image

Two meta-characters specify where certain characters must be found. There is the caret (^), which marks the beginning of a pattern. There is also the dollar sign ($), which marks the conclusion of a pattern. Accordingly, ^a will match any string beginning with an a, while a$ will correspond to any string ending with an a. Therefore, ^a$ will only match a (a string that both begins and ends with a).

These two meta-characters—the caret and the dollar sign—are crucial to validation, as validation normally requires checking the value of an entire string, not just the presence of one string in another. For example, using an email-matching pattern without those two characters will match any string containing an email address. Using an email-matching pattern that begins with a caret and ends with a dollar sign will match a string that contains only a valid email address.

Regular expressions also make use of the pipe (|) as the equivalent of or: a|b will match strings containing either a or b. (Using the pipe within patterns is called alternation or branching). So yes|no accepts either of those two words in their entirety (the alternation is not just between the two letters surrounding it: s and n).

Once you comprehend the basic symbols, then you can begin to use parentheses to group characters into more involved patterns. Grouping works as you might expect: (abc) will match abc, (trout) will match trout. Think of parentheses as being used to establish a new literal of a larger size. Because of precedence rules in PCRE, yes|no and (yes)|(no) are equivalent. But (even|heavy) handed will match either even handed or heavy handed.

To use simple patterns

1. Load pcre.php in your Web browser, if it is not already.

2. Check if a string contains the letters cat Image.

Image

Image Looking for a cat in a string.

To do so, use the literal cat as the pattern and any number of strings as the subject. Any of the following would be a match: catalog, catastrophe, my cat left, etc. For the time being, use all lowercase letters, as cat will not match Cat Image.

Image

Image PCRE performs a case-sensitive comparison by default.

Remember to use delimiters around the pattern, as well (see the figures).

3. Check if a string starts with cat Image.

Image

Image The caret in a pattern means that the match has to be found at the start of the string.

To have a pattern apply to the start of a string, use the caret as the first character (^cat). The sentence my cat left will not be a match now.

4. Check if a string contains the word color or colour Image.

Image

Image By using the pipe meta-character, the performed search can be more flexible.

The pattern to look for the American or British spelling of this word is col(o|ou)r. The first three letters—col—must be present. This needs to be followed by either an o or ou. Finally, an r is required.


Tip

If you are looking to match an exact string within another string, use the strstr( ) function, which is faster than regular expressions. In fact, as a rule of thumb, you should use regular expressions only if the task at hand cannot be accomplished using any other function or technique.



Tip

You can escape a bunch of characters in a pattern using Q and E. Every character within those will be treated literally (so Q$2.99?E matches $2.99?).



Tip

To match a single backslash, you have to use \\. The reason is that matching a backslash in a regular expression requires you to escape the backslash, resulting in \. Then to use a backslash in a PHP string, it also has to be escaped, so escaping both backslashes means a total of four.


Using Quantifiers

You’ve just seen and practiced with a couple of the meta-characters, the most important of which are the caret and the dollar sign. Next, there are three meta-characters that allow for multiple occurrences: a* will match zero or more a’s (no a’s, a, aa, aaa, etc.); a+ matches one or more a’s (a, aa, aaa, etc., but there must be at least one); and a? will match up to one a (a or no a’s match). These meta-characters all act as quantifiers in your patterns, as do the curly braces. Table 14.2 lists all of the quantifiers.

Table 14.2. Quantifiers

Image

To match a certain quantity of a thing, put the quantity between curly braces ( {} ), stating a specific number, just a minimum, or both a minimum and a maximum. Thus, a{3} will match aaa; a{3,} will match aaa, aaaa, etc. (three or more a’s); and a{3,5} will match just aaa, aaaa, and aaaaa (between three and five).

Note that quantifiers apply to the thing that came before it, so a? matches zero or one a’s, ab? matches an a followed by zero or one b’s, but (ab)? matches zero or one ab’s. Therefore, to match color or colour, you could also use colou?r as the pattern.

To use quantifiers

1. Load pcre.php in your Web browser, if it is not already.

2. Check if a string contains the letters c and t, with one or more letters in between Image.

Image

Image The plus sign, when used as a quantifier, requires that one or more of a thing be present.

To do so, use c.+t as the pattern and any number of strings as the subject. Remember that the period matches any character (except for the newline). Each of the following would be a match: cat, count, coefficient, etc. The word doctor would not match, as there are no letters between the c and the t (although doctor would match c.*t).

3. Check if a string matches either cat or cats Image.

Image

Image You can check for the plural form of many words by adding s? to the pattern.

To start, if you want to make an exact match, use both the caret and the dollar sign. Then you’d have the literal text cat, followed by an s, followed by a question mark (representing 0 or 1 s’s). The final pattern—^cats?$—matches cat or cats but not my cat left or I like cats.

4. Check if a string ends with .33, .333, or .3333 Image.

Image

Image The curly braces let you dictate the acceptable range of quantities present.

To find a period, escape it with a backslash: .. To find a three, use a literal 3. To find a range of 3’s, use the curly brackets ({}). Putting this together, the pattern is .3{2,4}. Because the string should end with this (nothing else can follow), conclude the pattern with a dollar sign: .3{2,4}$.

Admittedly, this is kind of a stupid example (not sure when you’d need to do exactly this), but it does demonstrate several things. This pattern will match lots of things—12.333, varmit.3333, .33, look .33—but not 12.3 or 12.334.

5. Match a five-digit number Image.

Image

Image The proper test for confirming that a number contains five digits.

A number can be any one of the numbers 0 through 9, so the heart of the pattern is (0|1|2|3|4|5|6|7|8|9). Plainly said, this means: a number is a 0 or a 1 or a 2 or a 3.... To make it a five-digit number, follow this with a quantifier: (0|1|2|3|4|5|6|7|8|9){5}. Finally, to match this exactly (as opposed to matching a five-digit number within a string), use the caret and the dollar sign: ^(0|1|2|3|4|5|6|7|8|9){5}$.

This, of course, is one way to match a United States zip code, a very useful pattern.


Tip

When using curly braces to specify a number of characters, you must always include the minimum number. The maximum is optional: a{3} and a{3,} are acceptable, but a{,3} is not.



Tip

Although it demonstrates good dedication to programming to learn how to write and execute your own regular expressions, numerous working examples are available already by searching the Internet.


Using Character Classes

As the last example demonstrated (Image in the previous section), relying solely upon literals in a pattern can be tiresome. Having to write out all those digits to match any number is silly. Imagine if you wanted to match any four-letter word: ^(a|b|c|d...){4}$ (and that doesn’t even take into account uppercase letters)! To make these common references easier, you can use character classes.

Classes are created by placing characters within square brackets ( [ ] ). For example, you can match any one vowel with [aeiou]. This is equivalent to (a|e|i|o|u). Or you can use the hyphen to indicate a range of characters: [a-z] is any single lowercase letter and [A-Z] is any uppercase, [A-Za-z] is any letter in general, and [0-9] matches any digit. As an example, [a-z]{3} would match abc, def, oiw, etc.

Within classes, most of the meta-characters are treated literally, except for four. The backslash is still the escape, but the caret (^) is a negation operator when used as the first character in the class. So [^aeiou] will match any non-vowel. The only other meta-character within a class is the dash, which indicates a range. (If the dash is used as the last character in a class, it’s a literal dash.) And, of course, the closing bracket (]) still has meaning as the terminator of the class.

Naturally, a class can have both ranges and literal characters. A person’s first name, which can contain letters, spaces, apostrophes, and periods, could be represented by [A-z ‘.] (again, the period doesn’t need to be escaped within the class, as it loses its meta-meaning there).

Along with creating your own classes, there are six already-defined classes that have their own shortcuts (Table 14.3). The digit and space classes are easy to understand. The word character class doesn’t mean “word” in the language sense but rather as in a string unbroken by spaces or punctuation.

Table 14.3. Character Classes

Image

Using this information, the five-digit number (aka, zip code) pattern could more easily be written as ^[0-9]{5}$ or ^d{5}$. As another example, cans?not will match both can not and cannot (the word can, followed by zero or one space characters, followed by not).

To use character classes

1. Load pcre.php in your Web browser, if it is not already.

2. Check if a string is formatted as a valid United States zip code Image.

Image

Image The pattern to match a United States zip code, in either the five-digit or five plus four format.

A United States zip code always starts with five digits (^d{5}). But a valid zip code could also have a dash followed by another four digits (-d{4}$). To make this last part optional, use the question mark (the 0 or 1 quantifier). This complete pattern is then ^(d{5}) (-d{4})?$. To make it all clearer, the first part of the pattern (matching the five digits) is also grouped in parentheses, although this isn’t required in this case.

3. Check if a string contains no spaces Image.

Image

Image The no-white-space shortcut can be used to ensure that a submitting string is contiguous.

The S character class shortcut will match non-space characters. To make sure that the entire string contains no spaces, use the caret and the dollar sign: ^S$. If you don’t use those, then all the pattern is confirming is that the subject contains at least one non-space character.

4. Validate an email address Image.

Image

Image A pretty good and reliable validation for email addresses.

The pattern ^[w.-]+@[w.-]+.[A-Za-z] {2,6}$ provides for reasonably good email validation. It’s wrapped in the caret and the dollar sign, so the string must be a valid email address and nothing more. An email address starts with letters, numbers, and the underscore (represented by w), plus a period (.) and a dash. This first block will match larryullman, larry77, larry.ullman, larry-ullman, and so on. Next, all email addresses include one and only one @. After that, there can be any number of letters, numbers, periods, and dashes. This is the domain name: larryullman, smith-jones, amazon.co (as in amazon.co.uk), etc. Finally, all email addresses conclude with one period and between two and six letters. This accounts for .com, .edu, .info, .travel, etc.


Tip

I think that the zip code example is a great demonstration as to how complex and useful regular expressions are. One pattern accurately tests for both formats of the zip code, which is fantastic. But when you put this into your PHP code, with quotes and delimiters, it’s not easily understood:

if (preg_match ('/^(d{5})(-d{4})?$/', $zip)) {

That certainly looks like gibberish, right?



Tip

This email address validation pattern is pretty good, although not perfect. It will allow some invalid addresses to pass through (like ones starting with a period or containing multiple periods together). However, a 100 percent foolproof validation pattern is ridiculously long, and frequently using regular expressions is really a matter of trying to exclude the bulk of invalid entries without inadvertently excluding any valid ones.



Tip

Regular expressions, particularly PCRE ones, can be extremely complex. When starting out, it’s just as likely that your use of them will break the validation routines instead of improving them. That’s why practicing like this is important.


Finding All Matches

Going back to the PHP functions used with Perl-Compatible regular expressions, preg_match( ) has been used just to see if a pattern matches a value or not. But the script hasn’t been reporting what, exactly, in the value did match the pattern. You can find out this information by providing a variable as a third argument to the function:

preg_match(pattern, subject, $match);

The $match variable will contain the first match found (because this function only returns the first match in a value). To find every match, use preg_match_all( ). Its syntax is the same:

preg_match_all(pattern, subject, $matches);

This function will return the number of matches made, or FALSE if none were found. It will also assign to $matches every match made. Let’s update the PHP script to print the returned matches, and then run a couple more tests.

To report all matches

1. Open pcre.php (Script 14.1) in your text editor or IDE, if it is not already.

2. Change the invocation of preg_match( ) to (Script 14.2):

if (preg_match_all ($pattern, $subject, $matches) ) {

There are two changes here. First, the actual function being called is different. Second, the third argument is provided a variable name that will be assigned every match.

Script 14.2. To reveal exactly what values in a string match which patterns, this revised version of the script will print out each match. You can retrieve the matches by naming a variable as the third argument in preg_match( ) or preg_match_all( ).


1   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML
    1.0 Transitional//EN" "http://www.w3.org/
    TR/xhtml1/DTD/xhtml1-transitional.dtd">
2   <html xmlns="http://www.w3.org/1999/
    xhtml" xml:lang="en" lang="en">
3   <head>
4      <meta http-equiv="Content-Type"
       content="text/html; charset=utf-8" />
5      <title>Testing PCRE</title>
6   </head>
7   <body>
8   <?php // Script 14.2 - matches.php
9   // This script takes a submitted string
    and checks it against a submitted pattern.
10  // This version prints every match made.
11
12  if ($_SERVER['REQUEST_METHOD'] = = 'POST') {
13
14     // Trim the strings:
15     $pattern = trim($_POST['pattern']);
16     $subject = trim($_POST['subject']);
17
18     // Print a caption:
19     echo "<p>The result of checking<br />
       <b>$pattern</b><br />against<br />
       $subject<br />is ";
20
21     // Test:
22     if (preg_match_all ($pattern,
       $subject, $matches) ) {
23        echo 'TRUE!</p>';
24
25        // Print the matches:
26        echo '<pre>' . print_r($matches,
          1) . '</pre>';
27
28     } else {
29        echo 'FALSE!</p>';
30     }
31
32  } // End of submission IF.
33  // Display the HTML form.
34  ?>
35  <form action="matches.php"
    method="post">
36     <p>Regular Expression Pattern:
       <input type="text" name="pattern"
       value="<?php if (isset($pattern)) echo
       htmlentities($pattern); ?>" size="40" />
       (include the delimiters)</p>
37     <p>Test Subject: <textarea
       name="subject" rows="5"
       cols="40"><?php if (isset($subject))
       echo htmlentities($subject); ?>
       </textarea></p>
38     <input type="submit" name="submit"
       value="Test!" />
39  </form>
40  </body>
41  </html>


3. After printing the value TRUE, print the contents of $matches:

echo '<pre>' . print_r($matches, 1) . '</pre>';

Using print_r( ) to output the contents of the variable is the easiest way to know what’s in $matches (you could use a foreach loop instead). As you’ll see when you run this script, this variable will be an array whose first element is an array of matches made.

4. Change the form’s action attribute to matches.php:

<form action="matches.php" method="post">

This script will be renamed, so the action attribute must be changed, too.

5. Change the subject input to be a textarea:

<p>Test Subject: <textarea name="subject" rows="5" cols="40"> <?php if (isset($subject)) echo htmlentities($subject); ?></textarea></p>

In order to be able to enter in more text for the subject, this element will become a textarea.

6. Save the file as matches.php, place it in your Web directory, and test it in your Web browser.

For the first test, use for as the pattern and This is a formulaic test for informal matches. as the subject Image. It may not be proper English, but it’s a good test subject.

Image

Image This first test returns three matches, as the literal text for was found three times.

For the second test, change the pattern to for.* Image. The result may surprise you, the cause of which is discussed in the sidebar, “Being Less Greedy.” To make this search less greedy, the pattern could be changed to for.*?, whose results would be the same as those in Image.

Image

Image Because regular expressions are “greedy” by default (see the sidebar), this pattern only finds one match in the string. That match happens to start with the first instance of for and continues until the end of the string.

For the third test, use for[S]*, or, more simply forS* Image. This has the effect of making the match stop as soon as a white space character is found (because the pattern wants to match for followed by any number of non–white space characters).

Image

Image This revised pattern matches strings that begin with for and end on a word.

For the final test, use [a-z]*for[a-z]* as the pattern Image. This pattern makes use of boundaries, discussed in the sidebar “Using Boundaries,” earlier in the chapter.

Image

Image Unlike the pattern in Image, this one matches entire words that contain for (informal here, formal in Image).


Tip

The preg_split( ) function will take a string and break it into an array using a regular expression pattern.


Using Modifiers

The majority of the special characters you can use in regular expression patterns are introduced in this chapter. One final type of special character is the pattern modifier. Table 14.4 lists these. Pattern modifiers are different than the other meta-characters in that they are placed after the closing delimiter.

Table 14.4. Pattern Modifiers

Image

Of these delimiters, the most important is i, which enables case-insensitive searches. All of the examples using variations on for (in the previous sequence of steps) would not match the word For. However, /for.*/i would be a match. Note that I am including the delimiters in that pattern, as the modifier goes after the closing one. Similarly, the last step in the previous sequence referenced the sidebar “Being Less Greedy” and stated how for.*? would perform a lazy search. So would /for.*/U.

The multiline mode is interesting in that you can make the caret and the dollar sign behave differently. By default, each applies to the entire value. In multiline mode, the caret matches the beginning of any line and the dollar sign matches the end of any line.

To use modifiers

1. Load matches.php in your Web browser, if it is not already.

2. Validate a list of email addresses Image.

Image

Image A list of email addresses, one per line, can be validated using the multiline mode. Each valid address is stored in $matches.

To do so, use /^[w.-]+@[w.-]+.[A-Za-z] {2,6} ?$/m as the pattern. You’ll see that I’ve added an optional carriage return ( ?) before the dollar sign. This is necessary because some of the lines will contain returns and others won’t. And in multiline mode, the dollar sign matches the end of a line. (To be more flexible, you could use s? instead.)

3. Validate a list of United States zip codes Image.

Image

Image Validating a list of zip codes, one per line.

Very similar to the example in Step 2, the pattern is now /^(d{5})(-d{4})? s?$/m. You’ll see that I’m using the more flexible s? instead of ?.

You’ll also notice when you try this yourself (or in Image) that the $matches variable contains a lot more information now. This will be explained in the next section of the chapter.


Tip

To always match the start or end of a pattern, regardless of the multiline setting, there are shortcuts you can use. Within the pattern, the shortcut A will match only the very beginning of the value, z matches the very end, and  matches any line end, like $ in single-line mode.



Tip

If your version of PHP supports it, it’s probably best to use the Filter extension to validate an email address or a URL. But if you have to validate a list of either, the Filter extension won’t cut it, and regular expressions will be required.


Matching and Replacing Patterns

The last subject to discuss in this chapter is how to match and replace patterns in a value. While preg_match( ) and preg_match_all( ) will find things for you, if you want to do a search and replace, you’ll need to use preg_replace( ). Its syntax is

preg_replace(pattern, replacement, subject);

This function takes an optional fourth argument limiting the number of replacements made.

To replace all instances of cat with dog, you would use

$str = preg_replace('/cat/', 'dog', 'I like my cat.'),

This function returns the altered value (or unaltered value if no matches were made), so you’ll likely want to assign it to a variable or use it as an argument to another function (like printing it by calling echo). Also, as a reminder, the above is just an example: you’d never want to replace one literal string with another using regular expressions, use str_replace( ) instead.

There is a related concept to discuss that is involved with this function: back referencing. In a zip code matching pattern—^(d{5})(-d{4})?$—there are two groups within parentheses: the first five digits and the optional dash plus four-digit extension. Within a regular expression pattern, PHP will automatically number parenthetical groupings beginning at 1. Back referencing allows you to refer to each individual section by using $ plus the corresponding number. For example, if you match the zip code 94710-0001 with this pattern, referring back to $2 will give you -0001. The code $0 refers to the whole initial string. This is why Image in the previous section shows entire zip code matches in $matches[0], the matching first five digits in $matches[1], and any matching dash plus four digits in $matches[2].

To practice with this, let’s modify Script 14.2 to also take a replacement input Image.

Image

Image One use of preg_replace( ) would be to replace variations on inappropriate words with symbols representing their omission.

To match and replace patterns

1. Open matches.php (Script 14.2) in your text editor or IDE, if it is not already.

2. Add a reference to a third incoming variable (Script 14.3):

$replace = trim($_POST['replace']);

As you can see in Image, the third form input (added between the existing two) takes the replacement value. That value is also trimmed to get rid of any extraneous spaces.

If your server has Magic Quotes enabled, you’ll again need to apply stripslashes( ) here.

Script 14.3. To test the preg_replace( ) function, which replaces a matched pattern in a string with another value, you can use this third version of the PCRE test script


1   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/
    xhtml1-transitional.dtd">
2   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
3   <head>
4      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
5      <title>Testing PCRE Replace</title>
6   </head>
7   <body>
8   <?php // Script 14.3 - replace.php
9   // This script takes a submitted string and checks it against a submitted pattern.
10  // This version replaces one value with another.
11
12  if ($_SERVER['REQUEST_METHOD'] = = 'POST') {
13
14     // Trim the strings:
15     $pattern = trim($_POST['pattern']);
16     $subject = trim($_POST['subject']);
17     $replace = trim($_POST['replace']);
18
19     // Print a caption:
20     echo "<p>The result of replacing<br /><b>$pattern</b><br />with<br />$replace
       <br />in<br />$subject<br /><br />";
21
22     // Check for a match:
23     if (preg_match ($pattern, $subject) ) {
24        echo preg_replace($pattern, $replace, $subject) . '</p>';
25     } else {
26        echo 'The pattern was not found!</p>';
27     }
28
29  } // End of submission IF.
30  // Display the HTML form.
31  ?>
32  <form action="replace.php" method="post">
33     <p>Regular Expression Pattern: <input type="text" name="pattern" value="<?php if
       (isset($pattern)) echo htmlentities($pattern); ?>" size="40" /> (include the delimiters)</p>
34     <p>Replacement: <input type="text" name="replace" value="<?php if (isset($replace))
       echo htmlentities($replace); ?>" size="40" /></p>
35     <p>Test Subject: <textarea name="subject" rows="5" cols="40"><?php if (isset($subject)) echo
       htmlentities($subject); ?></textarea></p>
36     <input type="submit" name="submit" value="Test!" />
37  </form>
38  </body>
39  </html>


3. Change the caption:

echo "<p>The result of replacing<br /><b>$pattern</b><br /> with<br />$replace<br />in<br />$subject<br /><br />";

The caption will print out all of the incoming values, prior to applying preg_replace( ).

4. Change the regular expression conditional so that it only calls preg_replace( ) if a match is made:

if (preg_match ($pattern, $subject) ) {
  echo preg_replace($pattern, $replace, $subject) . '</p>';
} else {
  echo 'The pattern was not found!</p>';
}

You can call preg_replace( ) without running preg_match( ) first. If no match was made, then no replacement will occur. But to make it clear when a match is or is not being made (which is always good to confirm, considering how tricky regular expressions are), the preg_match( ) function will be applied first. If it returns a TRUE value, then preg_replace( ) is called, printing the results Image. Otherwise, a message is printed indicating that no match was made Image.

Image

Image The resulting text has uses of bleep, bleeps, bleeped, bleeper, and bleeping replaced with *****.

Image

Image If the pattern is not found within the subject, the subject will not be changed.

5. Change the form’s action attribute to replace.php:

<form action="replace.php" method="post">

This file will be renamed, so this value needs to be changed accordingly.

6. Add a text input for the replacement string:

<p>Replacement: <input type="text" name="replace" value="<?php if (isset($replace)) echo htmlentities($replace); ?>" size="40" /></p>

7. Save the file as replace.php, place it in your Web directory, and test it in your Web browser Image.

Image

Image Another use of preg_replace( ) is dynamically turning email addresses into clickable links. See the HTML source code for the full effect of the replacement.

As a good example, you can turn an email address found within some text into its HTML link equivalent: <a href="mailto:[email protected]">[email protected]</a>. The pattern for matching an email address should be familiar by now: ^[w.-]+@ [w.-]+.[A-Za-z]{2,6}$. However, because the email address could be found within some text, the caret and dollar sign need to be replaced by the word boundaries shortcut: . The final pattern is therefore /[w.-]+@[w.-]+.[A-Za-z]{2,6}/.

To refer to this matched email address, you can refer to $0 (because $0 refers to the entire match, whether or not parentheses are used). So the replacement value would be <a href=”mailto:$0”>$0</a>. Because HTML is involved here, look at the HTML source code of the resulting page for the best idea of what happened.


Tip

Back references can even be used within the pattern. For example, if a pattern included a grouping (i.e., a subpattern) that would be repeated.



Tip

I’ve introduced, somewhat quickly, the bulk of the PCRE syntax here, but there’s much more to it. Once you’ve mastered all this, you can consider moving on to anchors, named subpatterns, comments, lookarounds, possessive quantifiers, and more.


Review and Pursue

If you have any problems with the review questions or the pursue prompts, turn to the book’s supporting forum (www.LarryUllman.com/forums/).

Review

• What function is used to match a regular expression? What function is used to find all matches of a regular expression? What function is used to replace matches of a regular expression?

• What characters can you use and not use to delineate a regular expression?

• How do you match a literal character or string of characters?

• What are meta-characters? How do you escape a meta-character?

• What meta-character do you use to bind a pattern to the beginning of a string? To the end?

• How do you create subpatterns (aka groupings)?

• What are the quantifiers? How do you require 0 or 1 of a character or string? 0 or more? 1 or more? Precisely X occurrences? A range of occurrences? A minimum of occurrences?

• What are character classes?

• What meta-characters still have meaning within character classes?

• What shortcut represents the “any digit” character class? The “any white space” class? “Any word”? What shortcuts represent the opposite of these?

• What are boundaries? How do you create boundaries in patterns?

• How do you make matches “lazy”? And what does that mean anyway?

• What are the pattern modifiers?

• What is back referencing? How does it work?

Pursue

• Search online for a PCRE “cheat sheet” (PHP or otherwise) that lists all the meaningful characters and classes. Print out the cheat sheet and keep it beside your computer.

• Practice, practice, practice!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.245.219