Programmers build applications that are based on established rules regarding the classification, parsing, storage, and display of information, whether that information consists of gourmet recipes, store sales receipts, poetry, or some other collection of data. This chapter introduces many of the PHP functions that you'll undoubtedly use on a regular basis when performing such tasks.
This chapter covers the following topics:
Validate_US
package: In this and subsequent chapters, various PEAR packages are introduced that are relevant to the respective chapter's subject matter. This chapter introduces Validate_US
, a PEAR package that is useful for validating the syntax for items commonly used in applications of all types, including phone numbers, Social Security numbers (SSNs), ZIP codes, and state abbreviations. (If you're not familiar with PEAR, it's introduced in Chapter 11.)Regular expressions provide the foundation for describing or matching data according to defined syntax rules. A regular expression is nothing more than a pattern of characters itself, matched against a certain parcel of text. This sequence may be a pattern with which you are already familiar, such as the word dog, or it may be a pattern with specific meaning in the context of the world of pattern matching, <(?)>.*< /.?>
, for example.
PHP is bundled with function libraries supporting both the POSIX and Perl regular expression implementations. Each has its own unique style of syntax and is discussed accordingly in later sections. Keep in mind that innumerable tutorials have been written regarding this matter; you can find information on the Web and in various books. Therefore, this chapter provides just a basic introduction to each, leaving it to you to search out further information.
If you are not already familiar with the mechanics of general expressions, please take some time to read through the short tutorial that makes up the remainder of this section. If you are already a regular expression pro, feel free to skip past the tutorial to the section "PHP's Regular Expression Functions (POSIX Extended)."
The structure of a POSIX regular expression is similar to that of a typical arithmetic expression: various elements (operators) are combined to form a more complex expression. The meaning of the combined regular expression elements is what makes them so powerful. You can locate not only literal expressions, such as a specific word or number, but also a multitude of semantically different but syntactically similar strings, such as all HTML tags in a file.
Note POSIX stands for Portable Operating System Interface for Unix, and is representative of a set of standards originally intended for Unix-based operating systems. POSIX regular expression syntax is an attempt to standardize how regular expressions are implemented in many programming languages.
The simplest regular expression is one that matches a single character, such as g
, which would match strings such as gog
, haggle
, and bag
. You could combine several letters together to form larger expressions, such as gan
, which logically would match any string containing gan: gang
, organize
, or Reagan
, for example.
You can also test for several different expressions simultaneously by using the pipe (|) character. For example, you could test for php
or zend
via the regular expression php|zend
.
Before getting into PHP's POSIX-based regular expression functions, let's review three methods that POSIX supports for locating different character sequences: brackets, quantifiers, and predefined character ranges.
Brackets
Brackets ([]) are used to represent a list, or range, of characters to be matched. For instance, contrary to the regular expression php
, which will locate strings containing the explicit string php
, the regular expression [php]
will find any string containing the character p
or h
. Several commonly used character ranges follow:
[0-9]
matches any decimal digit from 0
through 9
.[a-z]
matches any character from lowercase a
through lowercase z
.[A-Z]
matches any character from uppercase A
through uppercase Z
.[A-Za-z]
matches any character from uppercase A
through lowercase z
.Of course, the ranges shown here are general; you could also use the range [0-3]
to match any decimal digit ranging from 0
through 3
, or the range [b-v]
to match any lowercase character ranging from b
through v
. In short, you can specify any ASCII range you wish.
Quantifiers
Sometimes you might want to create regular expressions that look for characters based on their frequency or position. For example, you might want to look for strings containing one or more instances of the letter p
, strings containing at least two p
's, or even strings with the letter p
as their beginning or terminating character. You can make these demands by inserting special characters into the regular expression. Here are several examples of these characters:
p+
matches any string containing at least one p
.p*
matches any string containing zero or more p
's.p?
matches any string containing zero or one p
.p{2}
matches any string containing a sequence of two p
's.p{2,3}
matches any string containing a sequence of two or three p
's.p{2,}
matches any string containing a sequence of at least two p
's.p$
matches any string with p
at the end of it.Still other flags can be inserted before and within a character sequence:
^p
matches any string with p
at the beginning of it.[^a-zA-Z]
matches any string not containing any of the characters ranging from a
through z
and A
through Z
.p.p
matches any string containing p
, followed by any character, in turn followed by another p
.You can also combine special characters to form more complex expressions. Consider the following examples:
^.{2}$
matches any string containing exactly two characters.<b>(.*)</b>
matches any string enclosed within <b>
and </b>
.p(hp)
* matches any string containing a p followed by zero or more instances of the sequence hp.You may wish to search for these special characters in strings instead of using them in the special context just described. To do so, the characters must be escaped with a backslash ()
. For example, if you want to search for a dollar amount, a plausible regular expression would be as follows: ([$])([0-9]+)
; that is, a dollar sign followed by one or more integers. Notice the backslash preceding the dollar sign. Potential matches of this regular expression include $42
, $560
, and $3
.
Predefined Character Ranges (Character Classes)
For reasons of convenience, several predefined character ranges, also known as character classes, are available. Character classes specify an entire range of characters—for example, the alphabet or an integer set. Standard classes include the following:
[:alpha:]:
[A-Za-z]
.[:alnum:]:
[A-Za-z0-9]
.[:cntrl:]:
[:digit:]:
[0-9]
.[:graph:]:
[:lower:]:
[:punct:]:
~ ` ! @ # $ % ^ & * ( ) - _ + = { } [ ] : ; ' < > , . ? and /
.[:upper:]:
[:space:]:
[:xdigit:]:
[a-fA-F0-9]
.PHP offers seven functions for searching strings using POSIX-style regular expressions: ereg()
, ereg_replace()
, eregi()
, eregi_replace()
, split()
, spliti()
, and sql_regcase()
. These functions are discussed in this section.
Performing a Case-Sensitive Search
The ereg()
function executes a case-sensitive search of a string for a defined pattern, returning TRUE
if the pattern is found, and FALSE
otherwise. Its prototype follows:
boolean ereg(string pattern, string string [, array regs])
Here's how you could use ereg()
to ensure that a username consists solely of lowercase letters:
<?php
$username = "jasoN";
if (ereg("([^a-z])",$username))
echo "Username must be all lowercase!";
else
echo "Username is all lowercase!";
?>
In this case, ereg()
will return TRUE
, causing the error message to output.
The optional input parameter regs contains an array of all matched expressions that are grouped by parentheses in the regular expression. Making use of this array, you could segment a URL into several pieces, as shown here:
<?php
$url = "http://www.apress.com";
// Break $url down into three distinct pieces:
// "http://www", "apress", and "com"
$parts = ereg("^(http://www).([[:alnum:]]+).([[:alnum:]]+)", $url, $regs);
echo $regs[0]; // outputs the entire string "http://www.apress.com"
echo "<br />";
echo $regs[1]; // outputs "http://www"
echo "<br />";
echo $regs[2]; // outputs "apress"
echo "<br />";
echo $regs[3]; // outputs "com"
?>
This returns the following:
http://www.apress.com
http://www
apress
com
Performing a Case-Insensitive Search
The eregi()
function searches a string for a defined pattern in a case-insensitive fashion. Its prototype follows:
int eregi(string pattern, string string, [array regs])
This function can be useful when checking the validity of strings, such as passwords. This concept is illustrated in the following example:
<?php
$pswd = "jasonasdf";
if (!eregi("^[a-zA-Z0-9]{8,10}$", $pswd))
echo "Invalid password!";
else
echo "Valid password!";
?>
In this example, the user must provide an alphanumeric password consisting of eight to ten characters, or else an error message is displayed.
Replacing Text in a Case-Sensitive Fashion
The ereg_replace()
function operates much like ereg()
, except that its power is extended to finding and replacing a pattern with a replacement string instead of simply locating it. Its prototype follows:
string ereg_replace(string pattern, string replacement, string string)
If no matches are found, the string will remain unchanged. Like ereg()
, ereg_replace()
is case sensitive. Consider an example:
<?php
$text = "This is a link to http://www.wjgilmore.com/.";
echo ereg_replace("http://([a-zA-Z0-9./-]+)$",
"<a href="\0">\0</a>",
$text);
?>
This returns the following:
This is a link to
<a href="http://www.wjgilmore.com/">http://www.wjgilmore.com</a>.
A rather interesting feature of PHP's string-replacement capability is the ability to back-reference parenthesized substrings. This works much like the optional input parameter regs
in the function ereg()
, except that the substrings are referenced using backslashes, such as