Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Frank M. KromannBeginning PHP and MySQLhttps://doi.org/10.1007/978-1-4302-6044-8_9

9. Strings and Regular Expressions

Frank M. Kromann¹

(1)

Aliso Viejo, CA, USA

Programmers build applications based on established rules regarding the classification, parsing, storage, and display of information, whether that information consists of gourmet recipes, store sales receipts, poetry, or anything else. This chapter introduces many of the PHP functions that you’ll undoubtedly use on a regular basis when performing such tasks.

This chapter covers the following topics:

Regular expressions: PHP supports the use of regular expressions to search strings for patterns or replace elements of a string with another value based on patterns. There are several types of regular expressions, and the one supported in PHP is called Pearl style regex or PCRE.
String manipulation: PHP is the “Swiss Army Knife” of string manipulation, allowing you to slice and dice text in nearly every conceivable fashion. Offering nearly 100 native string manipulation functions, and the ability to chain functions together to produce even more sophisticated behaviors, you’ll run out of programming ideas before exhausting PHP’s capabilities in this regard. In this chapter, I’ll introduce you to several of the most commonly used manipulation functions that PHP has to offer.

Regular Expressions

Regular expressions provide the foundation for describing or matching data according to defined syntax rules. A regular expression is nothing more than a pattern of characters itself, matched against a certain parcel of text. This sequence may be a pattern with which you are already familiar, such as the word dog, or it may be a pattern with a specific meaning in the context of the world of pattern matching, <(?)>.*< /.?>, for example.

If you are not already familiar with the mechanics of general expressions, please take some time to read through the short tutorial that makes up the remainder of this section. However, because innumerable online and print tutorials have been written regarding this matter, I’ll focus on providing you with just a basic introduction to the topic. If you are already well-acquainted with regular expression syntax, feel free to skip past the tutorial to the “PHP’s Regular Expression Functions (Perl Compatible)” section.

Regular Expression Syntax (Perl)

Perl has long been considered one of the most powerful parsing languages ever written. It provides a comprehensive regular expression language that can be used to search, modify, and replace even the most complicated of string patterns. The developers of PHP felt that instead of reinventing the regular expression wheel, so to speak, they should make the famed Perl regular expression syntax available to PHP users.

Perl’s regular expression syntax is actually a derivation of the POSIX implementation, resulting in considerable similarities between the two. The remainder of this section is devoted to a brief introduction of Perl regular expression syntax. Let’s start with a simple example of a Perl-based regular expression:

/food/

Notice that the string food is enclosed between two forward slashes, also called delimiters. In addition to slashes (/), it is also possible to use a hash sign (#), plus (+), percentage (%), and others. The character used as the delimiter must be escaped with a backslash () if it’s used in the pattern. Using a different delimiter will possible remove the need for escaping. If you are matching a URL pattern that includes many slashes, it might be more convenient to use a hash sign as the delimiter as shown below:

/http://somedomain.com//

#http://somedomain.com/#

Instead of matching exact words, it’s possible to use quantifiers to match multiple words:

/fo+/

The use of the + qualifier indicates that any string that contains an f followed by one or more o’s will match the pattern. Some potential matches include food, fool, and fo4. Alternatively, the * qualifier is used to match 0 or more of the preceding characters. As an example

/fo*/

will match any section of the string with an f followed by 0 or more o’s. This will match the food, fool, and fo4 from the previous example but also fast and fine, etc. Both these qualifiers have no upper limits on the number of repetitions of a character. Adding such upper limits can be done as shown in the next example :

/fo{2,4}/

This matches f followed by two to four occurrences of o. Some potential matches include fool, fooool, and foosball.

The three examples above define a pattern starting with an f, followed by 1 or more o, 0 or more o’s, or between 2 and 4 o’s. Any character before or after the pattern is not part of the match.

Modifiers

Often you’ll want to tweak the interpretation of a regular expression; for example, you may want to tell the regular expression to execute a case-insensitive search or to ignore comments embedded within its syntax. These tweaks are known as modifiers, and they go a long way toward helping you to write short and concise expressions. A few of the more interesting modifiers are outlined in Table 9-1. A full list of valid modifiers with detailed descriptions can be found here: http://php.net/manual/en/reference.pcre.pattern.modifiers.php .

Table 9-1

Five Sample Modifiers

Modifier	Description
i	Perform a case-insensitive search.
m	Treat a string as several (m for multiple) lines. By default, the ^ and $ characters match at the very start and very end of the string in question. Using the m modifier will allow for ^ and $ to match at the beginning of any line in a string.
s	Treat a string as a single line, ignoring any newline characters found within.
x	Ignore whitespace and comments within the regular expression, unless the whitespace is escaped or within a character block.
U	Stop at the first match. Many quantifiers are “greedy”; they match the pattern as many times as possible rather than just stop at the first match. You can cause them to be “ungreedy” with this modifier.

These modifiers are placed directly after the regular expression—for instance,

/string/i. Let’s consider an examples:

/wmd/i: Matches WMD, wMD, WMd, wmd, and any other case variation of the string wmd.

Other languages support a global modifier (g). In PHP, however, this is implemented with the use of different functions preg_match() and preg_match_all().

Metacharacters

Perl regular expressions also employ metacharacters to further filter their searches. A metacharacter is simply a character or character sequence that symbolizes special meaning. A list of useful metacharacters follows:

A : Matches only at the beginning of the string.
: Matches a word boundary.
B : Matches anything but a word boundary.
d : Matches a digit character. This is the same as [0-9].
D : Matches a nondigit character.
s : Matches a whitespace character.
S : Matches a nonwhitespace character.
[] : Encloses a character class .
() : Encloses a character grouping or defines a back reference or the start and end of a subpattern.
$: Matches the end of a line.
^ : Matches the beginning of the string or beginning of every line in multiline mode.
.: Matches any character except for the newline.
: Quotes the next metacharacter.
w : Matches any string containing solely underscore and alphanumeric characters. This depends on the Locale. For U.S. English this is the same as [a-zA-Z0-9_].
W : Matches a string, omitting the underscore and alphanumeric characters.

Let’s consider a few examples. The first regular expression will match strings such as pisa and lisa but not sand:

/sa/

The next matches the first case-insensitive occurrence of the word linux:

/linux/i

The opposite of the word boundary metacharacter is B, matching on anything but a word boundary. Therefore, this example will match strings such as sand and Sally but not Melissa:

/saB/i

The final example returns all instances of strings matching a dollar sign followed by one or more digits:

/$d+/

PHP’s Regular Expression Functions (Perl Compatible)

PHP offers nine functions for searching and modifying strings using Perl-compatible regular expressions: preg_filter(), preg_grep(), preg_match(), preg_match_all(), preg_quote(), preg_replace(), preg_replace_callback(),preg_replace_callback_array(), and preg_split(). In addition to these, the preg_last_error() function provides a way to get the error code for the last execution. These functions are introduced in the following sections.

Searching for a Pattern

The preg_match() function searches a string for a specific pattern, returning TRUE if it exists and FALSE otherwise. Its prototype follows:

int preg_match(string pattern, string string [, array matches] [, int flags [, int offset]]])

The optional input parameter matches is passed by reference and will contain various sections of the subpatterns contained in the search pattern, if applicable. Here’s an example that uses preg_match() to perform a case-insensitive search:

<?php

$line = "vim is the greatest word processor ever created! Oh vim, how I love thee!";

if (preg_match("/Vim/i", $line, $match)) print "Match found!";

For instance, this script will confirm a match if the word Vim or vim is located, but not simplevim, vims, or evim.

You can use the optional flags parameter to modify the behavior of the returned matches parameter, changing how the array is populated by instead returning every matched string and its corresponding offset as determined by the location of the match.

Finally, the optional offset parameter will adjust the search starting point within the string to a specified position.

Matching All Occurrences of a Pattern

The preg_match_all() function matches all occurrences of a pattern in a string, assigning each occurrence to an array in the order you specify via an optional input parameter. Its prototype follows:

int preg_match_all(string pattern, string string, array matches [, int flags] [, int offset]))

The flags parameter accepts one of three values:

PREG_PATTERN_ORDER is the default if the optional flags parameter is not defined. PREG_PATTERN_ORDER specifies the order in the way that you might think most logical: $pattern_array[0] is an array of all complete pattern matches, $pattern_array[1] is an array of all strings matching the first parenthesized regular expression, and so on.
PREG_SET_ORDER orders the array a bit differently than the default setting. $pattern_array[0] contains elements matched by the first parenthesized regular expression, $pattern_array[1] contains elements matched by the second parenthesized regular expression, and so on.
PREG_OFFSET_CAPTURE modifies the behavior of the returned matches parameter , changing how the array is populated by instead returning every matched string and its corresponding offset as determined by the location of the match.

Here’s how you would use preg_match_all() to find all strings enclosed in bold HTML tags:

<?php

$userinfo = "Name: Zeev Suraski Title: PHP Guru";

preg_match_all("/(.*)/U", $userinfo, $pat_array);

printf("%s %s", $pat_array[0][0], $pat_array[0][1]);

This returns the following:

Zeev Suraski

PHP Guru

Searching an Array

The preg_grep() function searches all elements of an array, returning an array consisting of all elements matching a certain pattern. Its prototype follows:

array preg_grep(string pattern, array input [, int flags])

Consider an example that uses this function to search an array for foods beginning with p:

<?php

$foods = array("pasta", "steak", "fish", "potatoes");

$food = preg_grep("/^p/", $foods);

print_r($food);

This returns the following:

Array ( [0] => pasta [3] => potatoes )

Note that the array corresponds to the indexed order of the input array. If the value at that index position matches, it’s included in the corresponding position of the output array. Otherwise, that position is empty. If you want to remove those instances of the array that are blank, filter the output array through the function array_values(), introduced in Chapter 5.

The optional input parameter flags accepts one value, PREG_GREP_INVERT . Passing this flag will result in retrieval of those array elements that do not match the pattern.

Delimiting Special Regular Expression Characters

The function preg_quote() inserts a backslash delimiter before every character of special significance to a regular expression syntax. These special characters include $ ^ * ( ) + = { } [ ] | \ : < >. Its prototype follows:

string preg_quote(string str [, string delimiter])

The optional parameter delimiter specifies what delimiter is used for the regular expression, causing it to also be escaped by a backslash. Consider an example:

<?php

$text = "Tickets for the fight are going for $500.";

echo preg_quote($text);

This returns the following:

Tickets for the fight are going for $500.

Replacing All Occurrences of a Pattern

The preg_replace() function replaces all occurrences of pattern with replacement and returns the modified result. Its prototype follows:

mixed preg_replace(mixed pattern, mixed replacement, mixed str [, int limit [, int count]])

Note that both the pattern and replacement parameters are defined as mixed. This is because you can supply a string or an array for either. The optional input parameter limit specifies how many matches should take place. Failing to set limit or setting it to -1 will result in the replacement of all occurrences (unlimited). Finally, the optional count parameter, passed by reference, will be set to the total number of replacements made. Consider an example:

<?php

$text = "This is a link to http://www.wjgilmore.com/.";

echo preg_replace("/http://(.*)//", "<a href="${0}">${0}</a>", $text);

This returns the following:

This is a link to

<a href="http://www.wjgilmore.com/">http://www.wjgilmore.com/</a>.

If you pass arrays as the pattern and replacement parameters, the function will cycle through each element of each array, making replacements as they are found. Consider this example, which could be marketed as a corporate report filter:

<?php

$draft = "In 2010 the company faced plummeting revenues and scandal.";

$keywords = array("/faced/", "/plummeting/", "/scandal/");

$replacements = array("celebrated", "skyrocketing", "expansion");

echo preg_replace($keywords, $replacements, $draft);

This returns the following:

In 2010 the company celebrated skyrocketing revenues and expansion.

The preg_filter() function operates in a fashion identical to preg_replace() , except that, rather than returning the modified results, only the matches are returned.

Creating a Custom Replacement Function

In some situations you might wish to replace strings based on a somewhat more complex set of criteria beyond what is provided by PHP’s default capabilities. For instance, consider a situation where you want to scan some text for acronyms such as IRS and insert the complete name directly following the acronym. To do so, you need to create a custom function and then use the function preg_replace_callback() to temporarily tie it into the language. Its prototype follows:

mixed preg_replace_callback(mixed pattern, callback callback, mixed str

[, int limit [, int count]])

The pattern parameter determines what you’re looking for and the str parameter defines the string you’re searching. The callback parameter defines the name of the function to be used for the replacement task. The optional parameter limit specifies how many matches should take place. Failing to set limit or setting it to -1 will result in the replacement of all occurrences. Finally, the optional count parameter will be set to the number of replacements made. In the following example, a function named acronym() is passed into preg_replace_callback() and is used to insert the long form of various acronyms into the target string:

<?php

// This function will add the acronym's long form

// directly after any acronyms found in $matches

function acronym($matches) {

$acronyms = array(

'WWW' => 'World Wide Web',

'IRS' => 'Internal Revenue Service',

'PDF' => 'Portable Document Format');

if (isset($acronyms[$matches[1]]))

return $acronyms[$matches[1]] . " (" . $matches[1] . ")";

else

return $matches[1];

}

// The target text

$text = "The <acronym>IRS</acronym> offers tax forms in

<acronym>PDF</acronym> format on the <acronym>WWW</acronym>.";

// Add the acronyms' long forms to the target text

$newtext = preg_replace_callback("/<acronym>(.*)</acronym>/U", 'acronym',

$text);

print_r($newtext);

This returns the following:

The Internal Revenue Service (IRS) offers tax forms

in Portable Document Format (PDF) on the World Wide Web (WWW).

PHP 7.0 introduced a variant of preg_replace_callback() called preg_replace_callback_array(). These functions work in similar ways, except the new function combines pattern and callback into an array of patterns and callbacks. This makes it possible to do multiple substitutions with a single function call.

Also note that with the introduction of anonymous functions, also called closures (see Chapter 4), it’s no longer needed to provide the callback parameter as a string with the name of a function. It can be written as an anonymous function. The above example would look like this:

<?php

// The target text

$text = "The <acronym>IRS</acronym> offers tax forms in

<acronym>PDF</acronym> format on the <acronym>WWW</acronym>.";

// Add the acronyms' long forms to the target text

$newtext = preg_replace_callback("/<acronym>(.*)</acronym>/U",

function($matches) {

$acronyms = array(

'WWW' => 'World Wide Web',

'IRS' => 'Internal Revenue Service',

'PDF' => 'Portable Document Format');

if (isset($acronyms[$matches[1]]))

return $acronyms[$matches[1]] . " (" . $matches[1] . ")";

else

return $matches[1];

$text);

print_r($newtext);

Splitting a String into Various Elements Based on a Case-Insensitive Pattern

The preg_split() function operates exactly like explode(), except that the pattern can also be defined in terms of a regular expression. Its prototype follows:

array preg_split(string pattern, string string [, int limit [, int flags]])

If the optional input parameter limit is specified, only that limit number of substrings is returned. Consider an example:

<?php

$delimitedText = "Jason+++Gilmore+++++++++++Columbus+++OH";

$fields = preg_split("/++/", $delimitedText);

foreach($fields as $field) echo $field." ";

This returns the following:

Jason

Gilmore

Columbus

Note

Later in this chapter, the “Alternatives for Regular Expression Functions” section offers several standard functions that can be used in lieu of regular expressions for certain tasks. In many cases, these alternative functions actually perform much faster than their regular expression counterparts.

Other String-Specific Functions

In addition to the regular expression-based functions discussed in the first half of this chapter, PHP offers approximately 100 functions collectively capable of manipulating practically every imaginable aspect of a string. To introduce each function would be out of the scope of this book and would only repeat much of the information in the PHP documentation. This section is devoted to a categorical FAQ of sorts, focusing upon the string-related issues that seem to most frequently appear within community forums. The section is divided into the following topics:

Determining string length
Comparing two strings
Manipulating string case
Converting strings to and from HTML
Alternatives for regular expression functions
Padding and stripping a string
Counting characters and words

Note

The functions described in this section assumes that the strings are comprised of single byte characters. That means the number of characters in a string is equal to the number of bytes. Some character sets uses multiple bytes to represent each character. The standard PHP functions will often fail to provide the correct values when used on multibyte strings. There is an extension available called mb_string that can be used to manipulate multibyte strings.

Determining the Length of a String

Determining string length is a repeated action within countless applications. The PHP function strlen() accomplishes this task quite nicely. This function returns the length of a string, where each character in the string is equivalent to one unit (byte). Its prototype follows:

int strlen(string str)

The following example verifies whether a user password is of acceptable length:

<?php

$pswd = "secretpswd";

if (strlen($pswd) < 10)

echo "Password is too short!";

else

echo "Password is valid!";

In this case, the error message will not appear because the chosen password consists of 10 characters, whereas the conditional expression validates whether the target string consists of less than 10 characters.

Comparing Two Strings

String comparison is arguably one of the most important features of the string-handling capabilities of any language. Although there are many ways in which two strings can be compared for equality, PHP provides four functions for performing this task: strcmp(), strcasecmp(), strspn(), and strcspn().

Comparing Two Strings’ Case Sensitively

The strcmp() function performs a case-sensitive comparison of two strings. Its prototype follows:

int strcmp(string str1, string str2)

It will return one of three possible values based on the comparison outcome:

0 if str1 and str2 are equal
-1 if str1 is less than str2
1 if str2 is less than str1

Websites often require a registering user to enter and then confirm a password, lessening the possibility of an incorrectly entered password as a result of a typing error. strcmp() is a great function for comparing the two password entries because passwords are usually treated in a case-sensitive fashion:

<?php

$pswd = "supersecret";

$pswd2 = "supersecret2";

if (strcmp($pswd, $pswd2) != 0) {

echo "Passwords do not match!";

} else {

echo "Passwords match!";

}

Note that the strings must match exactly for strcmp() to consider them equal. For example, Supersecret is different from supersecret. If you’re looking to compare two strings’ case insensitively, consider strcasecmp() , introduced next.

Another common point of confusion regarding this function surrounds its behavior of returning 0 if the two strings are equal. This is different from executing a string comparison using the == operator, like so:

if ($str1 == $str2)

While both accomplish the same goal, which is to compare two strings, keep in mind that the values they return in doing so are different.

Comparing Two Strings’ Case Insensitively

The strcasecmp() function operates exactly like strcmp(), except that its comparison is case insensitive. Its prototype follows:

int strcasecmp(string str1, string str2)

The following example compares two e-mail addresses, an ideal use for strcasecmp() because case does not determine an e-mail address’s uniqueness:

<?php

$email1 = "[email protected]";

$email2 = "[email protected]";

if (! strcasecmp($email1, $email2))

echo "The email addresses are identical!";

In this example, the message is output because strcasecmp() performs a case-insensitive comparison of $email1 and $email2 and determines that they are indeed identical.

Calculating the Similarity Between Two Strings

The strspn() function returns the length of the first segment in a string containing characters also found in another string. Its prototype follows:

int strspn(string str1, string str2 [, int start [, int length]])

Here’s how you might use strspn() to ensure that a password does not consist solely of numbers:

<?php

$password = "3312345";

if (strspn($password, "1234567890") == strlen($password))

echo "The password cannot consist solely of numbers!";

In this case, the error message is returned because $password does indeed consist solely of digits.

You can use the optional start parameter to define a starting position within the string other than the default 0 offset. The optional length parameter can be used to define the length of str1 string that will be used in the comparison.

Calculating the Difference Between Two Strings

The strcspn() function returns the length of the first segment of a string containing characters not found in another string. The optional start and length parameters behave in the same fashion as those used in the previously introduced strspn() function. Its prototype follows:

int strcspn(string str1, string str2 [, int start [, int length]])

Here’s an example of password validation using strcspn():

<?php

$password = "a12345";

if (strcspn($password, "1234567890") == 0) {

echo "Password cannot consist solely of numbers!";

}

In this case, the error message will not be displayed because $password does not consist solely of numbers.

Manipulating String Case

Five functions are available to aid you in manipulating the case of characters in a string: strtolower(), strtoupper(), ucfirst(), lcfirst(), and ucwords().

Converting a String to All Lowercase

The strtolower() function converts a string to all lowercase letters, returning the modified string. Nonalphabetical characters are not affected. Its prototype follows:

string strtolower(string str)

The following example uses strtolower() to convert a URL to all lowercase letters:

<?php

$url = "http://WWW.EXAMPLE.COM/";

echo strtolower($url);

This returns the following:

http://www.example.com/

Converting a String to All Uppercase

Just as you can convert a string to lowercase, you can convert it to uppercase. This is accomplished with the function strtoupper() . Its prototype follows:

string strtoupper(string str)

Nonalphabetical characters are not affected. This example uses strtoupper() to convert a string to all uppercase letters:

<?php

$msg = "I annoy people by capitalizing e-mail text.";

echo strtoupper($msg);

This returns the following:

I ANNOY PEOPLE BY CAPITALIZING E-MAIL TEXT.

Capitalizing the First Letter of a String

The ucfirst() function capitalizes the first letter of the string str, if it is alphabetical. Its prototype follows:

string ucfirst(string str)

Nonalphabetical characters will not be affected. Additionally, any capitalized characters found in the string will be left untouched. Consider this example:

<?php

$sentence = "the newest version of PHP was released today!";

echo ucfirst($sentence);

This returns the following:

The newest version of PHP was released today!

Note that while the first letter is indeed capitalized, the capitalized word PHP was left untouched. The function lcfirst() performs the opposite action of turning the first character of a string to lowercase.

Capitalizing Each Word in a String

The ucwords() function capitalizes the first letter of each word in a string. Its prototype follows:

string ucwords(string str)

Nonalphabetical characters are not affected. This example uses ucwords() to capitalize each word in a string:

<?php

$title = "O'Malley wins the heavyweight championship!";

echo ucwords($title);

This returns the following:

O'Malley Wins The Heavyweight Championship!

Note that if O’Malley was accidentally written as O’malley, ucwords() would not catch the error, as it considers a word to be defined as a string of characters separated from other entities in the string by a blank space on each side.

Converting Strings to and from HTML

Converting a string or an entire file into a form suitable for viewing on the Web (and vice versa) is easier than you would think, and it comes with some security risks. If the input string is provided by a user who is browsing the website, it could be possible to inject script code that will be executed by the browser as it now looks like that code came from the server. Do not trust the input from users. The following functions are suited for such tasks.

Converting Newline Characters to HTML Break Tags

The nl2br() function converts all newline ( ) characters in a string to their XHTML-compliant equivalent, . Its prototype follows:

string nl2br(string str)

The newline characters could be created via a carriage return, or explicitly written into the string. The following example translates a text string to HTML format:

<?php

$recipe = "3 tablespoons Dijon mustard

1/3 cup Caesar salad dressing

8 ounces grilled chicken breast

3 cups romaine lettuce";

// convert the newlines to 's.

echo nl2br($recipe);

Executing this example results in the following output:

3 tablespoons Dijon mustard

1/3 cup Caesar salad dressing

8 ounces grilled chicken breast

3 cups romaine lettuce

Converting Special Characters to Their HTML Equivalents

During the general course of communication, you may come across many characters that are not included in a document’s text encoding, or that are not readily available on the keyboard. Examples of such characters include the copyright symbol (©), the cent sign (¢), and the grave accent (è). To facilitate such shortcomings, a set of universal key codes was devised, known as character entity references . When these entities are parsed by the browser, they will be converted into their recognizable counterparts. For example, the three aforementioned characters would be presented as ©, ¢, and È, respectively.

To perform these conversions, you can use the htmlentities() function . Its prototype follows:

string htmlentities(string str [, int flags [, int charset [, boolean double_encode]]])

Because of the special nature of quote marks within markup, the optional quote_style parameter offers the opportunity to choose how they will be handled. Three values are accepted:

ENT_COMPAT : Convert double quotes and ignore single quotes. This is the default.
ENT_NOQUOTES : Ignore both double and single quotes.
ENT_QUOTES : Convert both double and single quotes.

A second optional parameter, charset , determines the character set used for the conversion. Table 9-2 offers the list of supported character sets. If charset is omitted, it will default to the default character set defined with the php.ini setting default_charset .

Table 9-2

htmlentities()’s Supported Character Sets

Character Set	Description
BIG5	Traditional Chinese
BIG5-HKSCS	BIG5 with additional Hong Kong extensions, traditional Chinese
cp866	DOS-specific Cyrillic character set
cp1251	Windows-specific Cyrillic character set
cp1252	Windows-specific character set for Western Europe
EUC-JP	Japanese
GB2312	Simplified Chinese
ISO-8859-1	Western European, Latin-1
ISO-8859-5	Little-used Cyrillic charset (Latin/Cyrillic).
ISO-8859-15	Western European, Latin-9
KOI8-R	Russian
Shift_JIS	Japanese
MacRoman	Charset that was used by Mac OS
UTF-8	ASCII-compatible multibyte 8 encode

The final optional parameter double_encode will prevent htmlentities() from encoding any HTML entities that already exist in the string. In most cases, you’ll probably want to enable this parameter if you suspect HTML entities already exist in the target string.

The following example converts the necessary characters for web display:

<?php

$advertisement = "Coffee at 'Cafè Française' costs $2.25.";

echo htmlentities($advertisement);

This returns the following:

Coffee at 'Cafè Française' costs $2.25.

Two characters are converted, the grave accent (è) and the cedilla (ç). The single quotes are ignored due to the default quote_style setting ENT_COMPAT.

Using Special HTML Characters for Other Purposes

Several characters play a dual role in both markup languages and the human language. When used in the latter fashion, these characters must be converted into their displayable equivalents. For example, an ampersand must be converted to & whereas a greater-than character must be converted to >. The htmlspecialchars() function can do this for you, converting the following characters into their compatible equivalents. Its prototype follows:

string htmlspecialchars(string str [, int quote_style [, string charset [, boolean double_encode]]])

The optional charset and double_encode parameters operate in a fashion identical to the explanation provided in the previous section on the htmlentities() function.

The list of characters that htmlspecialchars() can convert and their resulting formats follow:

& becomes &
" (double quote) becomes "
' (single quote) becomes '
< becomes <
> becomes >

This function is particularly useful in preventing users from entering HTML markup into an interactive web application, such as a message board.

The following example converts potentially harmful characters using htmlspecialchars() :

<?php

$input = "I just can't get <<enough>> of PHP!";

echo htmlspecialchars($input);

Viewing the source, you’ll see the following:

I just can't get <<enough>> of PHP!

If the translation isn’t necessary, perhaps a more efficient way to do this would be to use strip_tags() , which deletes the tags from the string altogether.

Tip

If you are using htmlspecialchars() in conjunction with a function such as nl2br() , you should execute nl2br() after htmlspecialchars(); otherwise, the tags that are generated with nl2br() will be converted to visible characters.

Converting Text into Its HTML Equivalent

Using get_html_translation_table() is a convenient way to translate text to its HTML equivalent, returning one of the two translation tables (HTML_SPECIALCHARS or HTML_ENTITIES ). Its prototype follows:

array get_html_translation_table(int table [, int quote_style])

This returned value can then be used in conjunction with another predefined function, strtr() (formally introduced later in this section), to essentially translate the text into its corresponding HTML code.

The following sample uses get_html_translation_table() to convert text to HTML:

<?php

$string = "La pasta è il piatto più amato in Italia";

$translate = get_html_translation_table(HTML_ENTITIES);

echo strtr($string, $translate);

This returns the string formatted as necessary for browser rendering:

La pasta è il piatto più amato in Italia

Interestingly, array_flip() is capable of reversing the text-to-HTML translation and vice versa. Assume that instead of printing the result of strtr() in the preceding code sample, you assign it to the variable $translated_string .

The next example uses array_flip() to return a string back to its original value:

<?php

$entities = get_html_translation_table(HTML_ENTITIES);

$translate = array_flip($entities);

$string = "La pasta è il piatto più amato in Italia";

echo strtr($string, $translate);

This returns the following:

La pasta é il piatto più amato in italia

Creating a Customized Conversion List

The strtr() function converts all characters in a string to their corresponding match found in a predefined array. Its prototype follows:

string strtr(string str, array replacements)

This example converts the deprecated bold () character to its XHTML equivalent:

<?php

$table = array('' => '', '' => '');

$html = 'Today In PHP-Powered News';

echo strtr($html, $table);

This returns the following:

Today In PHP-Powered News

Converting HTML to Plain Text

You may sometimes need to convert an HTML file to plain text. You can do so using the strip_tags() function , which removes all HTML and PHP tags from a string, leaving only the text entities. Its prototype follows:

string strip_tags(string str [, string allowable_tags])

The optional allowable_tags parameter allows you to specify which tags you would like to be skipped during this process. Skipping tags does not address any attributes in the skipped tags. This could be dangerous if the input is provided by a user and those attributes contains JavaScript. This example uses strip_tags() to delete all HTML tags from a string:

<?php

$input = "Email <a href='[email protected]'>[email protected]</a>";

echo strip_tags($input);

This returns the following:

Email [email protected]

The following sample strips all tags except the <a> tag :

<?php

$input = "This <a href='http://www.example.com/'>example</a>

is awesome!";

echo strip_tags($input, "<a>");

This returns the following:

This <a href='http://www.example.com/'>example</a> is awesome!

Note

Another function that behaves like strip_tags() is fgetss(). This function is described in Chapter 10.

Alternatives for Regular Expression Functions

When you’re processing large amounts of information, the regular expression functions can slow matters dramatically. You should use these functions only when you are interested in parsing relatively complicated strings that require the use of regular expressions. If you are instead interested in parsing for simple expressions, there are a variety of predefined functions that speed up the process considerably. Each of these functions is described in this section.

Tokenizing a String Based on Predefined Characters

Tokenizing is a computer term for splitting a string into smaller parts. This is used by compilers to convert a program to individual commands or tokens. The strtok() function tokenizes the string based on a predefined list of characters. Its prototype follows:

string strtok(string str, string tokens)

One oddity about strtok() is that it must be continually called in order to completely tokenize a string; each call only tokenizes the next piece of the string. However, the str parameter needs to be specified only once because the function keeps track of its position in str until it either completely tokenizes str or a new str parameter is specified. Its behavior is best explained via an example:

<?php

$info = "J. Gilmore:[email protected]|Columbus, Ohio";

// delimiters include colon (:), vertical bar (|), and comma (,)

$tokens = ":|,";

$tokenized = strtok($info, $tokens);

// print out each element in the $tokenized array

while ($tokenized) {

echo "Element = $tokenized ";

// Don't include the first argument in subsequent calls.

$tokenized = strtok($tokens);

}

This returns the following:

Element = J. Gilmore

Element = [email protected]

Element = Columbus

Element = Ohio

Exploding a String Based on a Predefined Delimiter

The explode() function divides the string str into an array of substrings. Its prototype follows:

array explode(string separator, string str [, int limit])

The original string is divided into distinct elements by separating it based on the character separator specified by separator. The number of elements can be limited with the optional inclusion of limit. Let’s use explode() in conjunction with sizeof() and strip_tags() to determine the total number of words in a given block of text:

<?php

$summary = <<<summary

The most up to date source for PHP documentation is the PHP manual.

It contins many examples and user contributed code and comments.

It is available on the main PHP web site

<a href="http://www.php.net">PHP’s</a>.

summary;

$words = sizeof(explode(' ',strip_tags($summary)));

echo "Total words in summary: $words";

This returns the following:

Total words in summary: 46

The explode() function will always be considerably faster than preg_split(). Therefore, always use it instead of the others when a regular expression isn’t necessary.

Note

You might be wondering why the previous code is indented in an inconsistent manner. The multiple-line string was delimited using heredoc syntax, which requires the closing identifier to not be indented even a single space. See Chapter 3 for more information about heredoc.

Converting an Array into a String

Just as you can use the explode() function to divide a delimited string into various array elements, you can concatenate array elements to form a single delimited string using the implode() function . Its prototype follows:

string implode(string delimiter, array pieces)

This example forms a string out of the elements of an array:

<?php

$cities = array("Columbus", "Akron", "Cleveland", "Cincinnati");

echo implode("|", $cities);

This returns the following:

Columbus|Akron|Cleveland|Cincinnati

Performing Complex String Parsing

The strpos() function finds the position of the first case-sensitive occurrence of a substring in a string. Its prototype follows:

int strpos(string str, string substr [, int offset])

The optional input parameter offset specifies the position at which to begin the search. If substr is not in str, strpos() will return FALSE. The optional parameter offset determines the position from which strpos() will begin searching. The following example determines the timestamp of the first time index.html accessed:

<?php

$substr = "index.html";

$log = <<< logfile

192.168.1.11:/www/htdocs/index.html:[2010/02/10:20:36:50]

192.168.1.13:/www/htdocs/about.html:[2010/02/11:04:15:23]

192.168.1.15:/www/htdocs/index.html:[2010/02/15:17:25]

logfile;

// What is first occurrence of the time $substr in log?

$pos = strpos($log, $substr);

// Find the numerical position of the end of the line

$pos2 = strpos($log," ",$pos);

// Calculate the beginning of the timestamp

$pos = $pos + strlen($substr) + 1;

// Retrieve the timestamp

$timestamp = substr($log,$pos,$pos2-$pos);

echo "The file $substr was first accessed on: $timestamp";

This returns the position in which the file index.html is first accessed:

The file index.html was first accessed on: [2010/02/10:20:36:50]

The function stripos() operates identically to strpos(), except that it executes its search case insensitively.

Finding the Last Occurrence of a String

The strrpos() function finds the last occurrence of a string, returning its numerical position. Its prototype follows:

int strrpos(string str, char substr [, offset])

The optional parameter offset determines the position from which strrpos() will begin searching. Suppose you wanted to pare down lengthy news summaries, truncating the summary and replacing the truncated component with an ellipsis. However, rather than simply cut off the summary explicitly at the desired length, you want it to operate in a user-friendly fashion, truncating at the end of the word closest to the truncation length. This function is ideal for such a task. Consider this example:

<?php

// Limit $summary to how many characters?

$limit = 100;

$summary = <<< summary

The most up to date source for PHP documentation is the PHP manual.

It contins many examples and user contributed code and comments.

It is available on the main PHP web site

<a href="http://www.php.net">PHP’s</a>.

summary;

if (strlen($summary) > $limit)

$summary = substr($summary, 0, strrpos(substr($summary, 0, $limit),

' ')) . '...';

echo $summary;

This returns the following:

The most up to date source for PHP documentation is the PHP manual.

It contins many...

Replacing All Instances of a String with Another String

The str_replace() function case sensitively replaces all instances of a string with another. Its prototype follows:

mixed str_replace(string occurrence, mixed replacement, mixed str [, int count])

If occurrence is not found in str, the original string is returned unmodified. If the optional parameter count is defined, only count occurrences found in str will be replaced.

This function is ideal for hiding e-mail addresses from automated e-mail address retrieval programs:

<?php

$author = "[email protected]";

$author = str_replace("@","(at)",$author);

echo "Contact the author of this article at $author.";

This returns the following:

Contact the author of this article at jason(at)example.com.

The function str_ireplace() operates identically to str_replace(), except that it is capable of executing a case-insensitive search.

Retrieving Part of a String

The strstr() function returns the remainder of a string beginning with the first occurrence of a predefined string. Its prototype follows:

string strstr(string str, string occurrence [, bool before_needle])

The optional before_needle parameter modifies the behavior of strstr() , causing the function to instead return the part of the string that is found before the first occurrence.

This example uses the function in conjunction with the ltrim() function to retrieve the domain name of an e-mail address:

<?php

$url = "[email protected]";

echo ltrim(strstr($url, "@"),"@");

This returns the following:

example.com

Returning Part of a String Based on Predefined Offsets

The substr() function returns the part of a string located between a predefined starting offset and length positions. Its prototype follows:

string substr(string str, int start [, int length])

If the optional length parameter is not specified, the substring is considered to be the string starting at start and ending at the end of str. There are four points to keep in mind when using this function:

If start is positive, the returned string will begin at the start position of the string.
If start is negative, the returned string will begin at the length - start position of the string.
If length is provided and is positive, the returned string will consist of the characters between start and start + length. If this distance surpasses the total string length, only the string between start and the string’s end will be returned.
If length is provided and is negative, the returned string will end length characters from the end of str.

Keep in mind that start is the offset from the first character of str and strings (like arrays) are 0 indexed. Consider a basic example:

<?php

$car = "1944 Ford";

echo substr($car, 5);

This returns the following starting from the sixth character at position 5:

Ford

The following example uses the length parameter:

<?php

$car = "1944 Ford";

echo substr($car, 0, 4);

This returns the following:

1944

The final example uses a negative length parameter:

<?php

$car = "1944 Ford";

echo substr($car, 2, -5);

This returns the following:

Determining the Frequency of a String’s Appearance

The substr_count() function returns the number of times one string occurs within another. This function is case sensitive. Its prototype follows:

int substr_count(string str, string substring [, int offset [, int length]])

The optional offset and length parameters determine the string offset from which to begin attempting to match the substring within the string, and the maximum length of the string to search following the offset, respectively.

The following example determines the number of times an IT consultant uses various buzzwords in his presentation:

<?php

$buzzwords = array("mindshare", "synergy", "space");

$talk = <<< talk

I'm certain that we could dominate mindshare in this space with

our new product, establishing a true synergy between the marketing

and product development teams. We'll own this space in three months.

talk;

foreach($buzzwords as $bw) {

echo "The word $bw appears ".substr_count($talk,$bw)." time(s). ";

}

This returns the following:

The word mindshare appears 1 time(s).

The word synergy appears 1 time(s).

The word space appears 2 time(s).

Replacing a Portion of a String with Another String

The substr_replace() function replaces a portion of a string with a replacement string, beginning the substitution at a specified starting position and ending at a predefined replacement length. Its prototype follows:

string substr_replace(string str, string replacement, int start [, int length])

Alternatively, the substitution will stop on the complete placement of replacement in str. There are several behaviors you should keep in mind regarding the values of start and length:

If start is positive, replacement will begin at character start.
If start is negative, replacement will begin at str length - start.
If length is provided and is positive, replacement will be length characters long.
If length is provided and is negative, replacement will end at str length - length characters.

Suppose you built an e-commerce site and within the user profile interface, you want to show just the last four digits of the provided credit card number. This function is ideal for such a task:

<?php

$ccnumber = "1234567899991111";

echo substr_replace($ccnumber,"************",0,12);

This returns the following:

************1111

Padding and Stripping a String

For formatting reasons, you sometimes need to modify the string length via either padding or stripping characters. PHP provides a number of functions for doing so. This section examines many of the commonly used functions.

Trimming Characters from the Beginning of a String

The ltrim() function removes various characters from the beginning of a string, including whitespace, the horizontal tab ( ), newline ( ), carriage return ( ), NULL (), and vertical tab (x0b). Its prototype follows:

string ltrim(string str [, string charlist])

You can designate other characters for removal by defining them in the optional parameter charlist .

Trimming Characters from the End of a String

The rtrim() function operates identically to ltrim(), except that it removes the designated characters from the right side of a string. Its prototype follows:

string rtrim(string str [, string charlist])

Trimming Characters from Both Sides of a String

You can think of the trim() function as a combination of ltrim() and rtrim(), except that it removes the designated characters from both sides of a string:

string trim(string str [, string charlist])

Padding a String

The str_pad() function pads a string with a specified number of characters. Its prototype follows:

string str_pad(string str, int length [, string pad_string [, int pad_type]])

If the optional parameter pad_string is not defined, str will be padded with blank spaces; otherwise, it will be padded with the character pattern specified by pad_string. By default, the string will be padded to the right; however, the optional parameter pad_type may be assigned the values STR_PAD_RIGHT (the default), STR_PAD_LEFT , or STR_PAD_BOTH , padding the string accordingly. This example shows how to pad a string using this function:

<?php

echo str_pad("Salad", 10)." is good.";

This returns the following:

Salad is good.

This example makes use of str_pad()’s optional parameters:

<?php

$header = "Log Report";

echo str_pad ($header, 20, "=+", STR_PAD_BOTH);

This returns the following:

=+=+=Log Report=+=+=

Note that str_pad() truncates the pattern defined by pad_string if the length is reached before completing an entire repetition of the pattern.

Counting Characters and Words

It’s often useful to determine the total number of characters or words in a given string. Although PHP’s considerable capabilities in string parsing has long made this task trivial, the following two functions were added to formalize the process.

Counting the Number of Characters in a String

The function count_chars() offers information regarding the characters found in a string. This function only works on single byte characters. Its prototype follows:

mixed count_chars(string str [, int mode])

Its behavior depends on how the optional parameter mode is defined:

0: Returns an array consisting of each found byte value (0-255 representing each possible character) as the key and the corresponding frequency as the value, even if the frequency is zero. This is the default.
1: Same as 0, but returns only those byte values with a frequency greater than zero.
2: Same as 0, but returns only those byte values with a frequency of zero.
3: Returns a string containing all located byte values.
4: Returns a string containing all unused byte values.
The following example counts the frequency of each character in $sentence:

<?php

$sentence = "The rain in Spain falls mainly on the plain";

// Retrieve located characters and their corresponding frequency.

$chart = count_chars($sentence, 1);

foreach($chart as $letter=>$frequency)

echo "Character ".chr($letter)." appears $frequency times ";

This returns the following:

Character appears 8 times

Character S appears 1 times

Character T appears 1 times

Character a appears 5 times

Character e appears 2 times

Character f appears 1 times

Character h appears 2 times

Character i appears 5 times

Character l appears 4 times

Character m appears 1 times

Character n appears 6 times

Character o appears 1 times

Character p appears 2 times

Character r appears 1 times

Character s appears 1 times

Character t appears 1 times

Character y appears 1 times

Counting the Total Number of Words in a String

The function str_word_count() offers information regarding the total number of words found in a string. Words are defined as a string of alphabetical characters, depending on the local setting, and may contain but not start with – and ’. Its prototype follows:

mixed str_word_count(string str [, int format])

If the optional parameter format is not defined, it will return the total number of words. If format is defined, it modifies the function’s behavior based on its value:

1: Returns an array consisting of all words located in str.
2: Returns an associative array where the key is the numerical position of the word in str and the value is the word itself.

Consider an example:

<?php

$summary = <<< summary

The most up to date source for PHP documentation is the PHP manual.

It contins many examples and user contributed code and comments.

It is available on the main PHP web site

<a href="http://www.php.net">PHP's</a>.

summary;

$words = str_word_count($summary);

printf("Total words in summary: %s", $words);

This returns the following:

Total words in summary: 41

You can use this function in conjunction with array_count_values() to determine the frequency in which each word appears within the string:

<?php

$summary = <<< summary

The most up to date source for PHP documentation is the PHP manual.

It contins many examples and user contributed code and comments.

It is available on the main PHP web site

<a href="http://www.php.net">PHP’s</a>.

summary;

$words = str_word_count($summary,2);

$frequency = array_count_values($words);

print_r($frequency);

This returns the following:

Array ( [The] => 1 [most] => 1 [up] => 1 [to] => 1 [date] => 1 [source] => 1 [for] => 1 [PHP] => 4 [documentation] => 1 [is] => 2 [the] => 2 [manual] => 1 [It] => 2 [contins] => 1 [many] => 1 [examples] => 1 [and] => 2 [user] => 1 [contributed] => 1 [code] => 1 [comments] => 1 [available] => 1 [on] => 1 [main] => 1 [web] => 1 [site] => 1 [a] => 2 [href] => 1 [http] => 1 [www] => 1 [php] => 1 [net] => 1 [s] => 1 )

Summary

Many of the functions introduced in this chapter will be among the most commonly used within your PHP applications, as they form the crux of the language’s string-manipulation capabilities.

The next chapter examines another set of commonly used functions: those devoted to working with the file and operating system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Strings and Regular Expressions

Create new playlist

Sign In

Sign Up

9. Strings and Regular Expressions

Regular Expressions

Regular Expression Syntax (Perl)

Modifiers

Metacharacters

PHP’s Regular Expression Functions (Perl Compatible)

Searching for a Pattern

Matching All Occurrences of a Pattern

Searching an Array

Delimiting Special Regular Expression Characters

Replacing All Occurrences of a Pattern

Creating a Custom Replacement Function

Splitting a String into Various Elements Based on a Case-Insensitive Pattern

Note

Other String-Specific Functions

Note

Determining the Length of a String

Comparing Two Strings

Comparing Two Strings’ Case Sensitively

Comparing Two Strings’ Case Insensitively

Calculating the Similarity Between Two Strings

Calculating the Difference Between Two Strings

Manipulating String Case

Converting a String to All Lowercase

Converting a String to All Uppercase

Capitalizing the First Letter of a String

Capitalizing Each Word in a String

Converting Strings to and from HTML

Converting Newline Characters to HTML Break Tags

Converting Special Characters to Their HTML Equivalents

Using Special HTML Characters for Other Purposes

Tip

Converting Text into Its HTML Equivalent

Creating a Customized Conversion List

Converting HTML to Plain Text

Note

Alternatives for Regular Expression Functions

Tokenizing a String Based on Predefined Characters

Exploding a String Based on a Predefined Delimiter

Note

Converting an Array into a String

Performing Complex String Parsing

Finding the Last Occurrence of a String

Replacing All Instances of a String with Another String

Retrieving Part of a String

Returning Part of a String Based on Predefined Offsets

Determining the Frequency of a String’s Appearance

Replacing a Portion of a String with Another String

Padding and Stripping a String

Trimming Characters from the Beginning of a String

Trimming Characters from the End of a String

Trimming Characters from Both Sides of a String

Padding a String

Counting Characters and Words

Counting the Number of Characters in a String

Counting the Total Number of Words in a String

Summary

Table of Contents for
9. Strings and Regular Expressions