Programmers build applications based on established rules regarding the classification, parsing, storage, and display of information, whether that information consists of gourmet recipes, store sales receipts, poetry, or anything else. This chapter introduces many of the PHP functions that you’ll undoubtedly use on a regular basis when performing such tasks.
Regular expressions: PHP supports the use of regular expressions to search strings for patterns or replace elements of a string with another value based on patterns. There are several types of regular expressions, and the one supported in PHP is called Pearl style regex or PCRE.
String manipulation: PHP is the “Swiss Army Knife” of string manipulation, allowing you to slice and dice text in nearly every conceivable fashion. Offering nearly 100 native string manipulation functions, and the ability to chain functions together to produce even more sophisticated behaviors, you’ll run out of programming ideas before exhausting PHP’s capabilities in this regard. In this chapter, I’ll introduce you to several of the most commonly used manipulation functions that PHP has to offer.
Regular Expressions
Regular expressions provide the foundation for describing or matching data according to defined syntax rules. A regular expression is nothing more than a pattern of characters itself, matched against a certain parcel of text. This sequence may be a pattern with which you are already familiar, such as the word dog, or it may be a pattern with a specific meaning in the context of the world of pattern matching, <(?)>.*< /.?>, for example.
If you are not already familiar with the mechanics of general expressions, please take some time to read through the short tutorial that makes up the remainder of this section. However, because innumerable online and print tutorials have been written regarding this matter, I’ll focus on providing you with just a basic introduction to the topic. If you are already well-acquainted with regular expression syntax, feel free to skip past the tutorial to the “PHP’s Regular Expression Functions (Perl Compatible)” section.
Regular Expression Syntax (Perl)
Perl has long been considered one of the most powerful parsing languages ever written. It provides a comprehensive regular expression language that can be used to search, modify, and replace even the most complicated of string patterns. The developers of PHP felt that instead of reinventing the regular expression wheel, so to speak, they should make the famed Perl regular expression syntax available to PHP users.
This matches f followed by two to four occurrences of o. Some potential matches include fool, fooool, and foosball.
The three examples above define a pattern starting with an f, followed by 1 or more o, 0 or more o’s, or between 2 and 4 o’s. Any character before or after the pattern is not part of the match.
Modifiers
Five Sample Modifiers
Modifier | Description |
---|---|
i | Perform a case-insensitive search. |
m | Treat a string as several (m for multiple) lines. By default, the ^ and $ characters match at the very start and very end of the string in question. Using the m modifier will allow for ^ and $ to match at the beginning of any line in a string. |
s | Treat a string as a single line, ignoring any newline characters found within. |
x | Ignore whitespace and comments within the regular expression, unless the whitespace is escaped or within a character block. |
U | Stop at the first match. Many quantifiers are “greedy”; they match the pattern as many times as possible rather than just stop at the first match. You can cause them to be “ungreedy” with this modifier. |
These modifiers are placed directly after the regular expression—for instance,
/wmd/i: Matches WMD, wMD, WMd, wmd, and any other case variation of the string wmd.
Other languages support a global modifier (g). In PHP, however, this is implemented with the use of different functions preg_match() and preg_match_all().
Metacharacters
A : Matches only at the beginning of the string.
: Matches a word boundary.
B : Matches anything but a word boundary.
d : Matches a digit character. This is the same as [0-9].
D : Matches a nondigit character.
s : Matches a whitespace character.
S : Matches a nonwhitespace character.
[] : Encloses a character class .
() : Encloses a character grouping or defines a back reference or the start and end of a subpattern.
$: Matches the end of a line.
^ : Matches the beginning of the string or beginning of every line in multiline mode.
.: Matches any character except for the newline.
: Quotes the next metacharacter.
w : Matches any string containing solely underscore and alphanumeric characters. This depends on the Locale. For U.S. English this is the same as [a-zA-Z0-9_].
W : Matches a string, omitting the underscore and alphanumeric characters.
PHP’s Regular Expression Functions (Perl Compatible)
PHP offers nine functions for searching and modifying strings using Perl-compatible regular expressions: preg_filter(), preg_grep(), preg_match(), preg_match_all(), preg_quote(), preg_replace(), preg_replace_callback(),preg_replace_callback_array(), and preg_split(). In addition to these, the preg_last_error() function provides a way to get the error code for the last execution. These functions are introduced in the following sections.
Searching for a Pattern
For instance, this script will confirm a match if the word Vim or vim is located, but not simplevim, vims, or evim.
You can use the optional flags parameter to modify the behavior of the returned matches parameter, changing how the array is populated by instead returning every matched string and its corresponding offset as determined by the location of the match.
Finally, the optional offset parameter will adjust the search starting point within the string to a specified position.
Matching All Occurrences of a Pattern
PREG_PATTERN_ORDER is the default if the optional flags parameter is not defined. PREG_PATTERN_ORDER specifies the order in the way that you might think most logical: $pattern_array[0] is an array of all complete pattern matches, $pattern_array[1] is an array of all strings matching the first parenthesized regular expression, and so on.
PREG_SET_ORDER orders the array a bit differently than the default setting. $pattern_array[0] contains elements matched by the first parenthesized regular expression, $pattern_array[1] contains elements matched by the second parenthesized regular expression, and so on.
PREG_OFFSET_CAPTURE modifies the behavior of the returned matches parameter , changing how the array is populated by instead returning every matched string and its corresponding offset as determined by the location of the match.
Searching an Array
array preg_grep(string pattern, array input [, int flags])
Note that the array corresponds to the indexed order of the input array. If the value at that index position matches, it’s included in the corresponding position of the output array. Otherwise, that position is empty. If you want to remove those instances of the array that are blank, filter the output array through the function array_values(), introduced in Chapter 5.
The optional input parameter flags accepts one value, PREG_GREP_INVERT . Passing this flag will result in retrieval of those array elements that do not match the pattern.
Delimiting Special Regular Expression Characters
Replacing All Occurrences of a Pattern
The preg_filter() function operates in a fashion identical to preg_replace() , except that, rather than returning the modified results, only the matches are returned.
Creating a Custom Replacement Function
PHP 7.0 introduced a variant of preg_replace_callback() called preg_replace_callback_array(). These functions work in similar ways, except the new function combines pattern and callback into an array of patterns and callbacks. This makes it possible to do multiple substitutions with a single function call.
Splitting a String into Various Elements Based on a Case-Insensitive Pattern
Note
Later in this chapter, the “Alternatives for Regular Expression Functions” section offers several standard functions that can be used in lieu of regular expressions for certain tasks. In many cases, these alternative functions actually perform much faster than their regular expression counterparts.
Other String-Specific Functions
Determining string length
Comparing two strings
Manipulating string case
Converting strings to and from HTML
Alternatives for regular expression functions
Padding and stripping a string
Counting characters and words
Note
The functions described in this section assumes that the strings are comprised of single byte characters. That means the number of characters in a string is equal to the number of bytes. Some character sets uses multiple bytes to represent each character. The standard PHP functions will often fail to provide the correct values when used on multibyte strings. There is an extension available called mb_string that can be used to manipulate multibyte strings.
Determining the Length of a String
In this case, the error message will not appear because the chosen password consists of 10 characters, whereas the conditional expression validates whether the target string consists of less than 10 characters.
Comparing Two Strings
String comparison is arguably one of the most important features of the string-handling capabilities of any language. Although there are many ways in which two strings can be compared for equality, PHP provides four functions for performing this task: strcmp(), strcasecmp(), strspn(), and strcspn().
Comparing Two Strings’ Case Sensitively
0 if str1 and str2 are equal
-1 if str1 is less than str2
1 if str2 is less than str1
Note that the strings must match exactly for strcmp() to consider them equal. For example, Supersecret is different from supersecret. If you’re looking to compare two strings’ case insensitively, consider strcasecmp() , introduced next.
While both accomplish the same goal, which is to compare two strings, keep in mind that the values they return in doing so are different.
Comparing Two Strings’ Case Insensitively
In this example, the message is output because strcasecmp() performs a case-insensitive comparison of $email1 and $email2 and determines that they are indeed identical.
Calculating the Similarity Between Two Strings
In this case, the error message is returned because $password does indeed consist solely of digits.
You can use the optional start parameter to define a starting position within the string other than the default 0 offset. The optional length parameter can be used to define the length of str1 string that will be used in the comparison.
Calculating the Difference Between Two Strings
In this case, the error message will not be displayed because $password does not consist solely of numbers.
Manipulating String Case
Five functions are available to aid you in manipulating the case of characters in a string: strtolower(), strtoupper(), ucfirst(), lcfirst(), and ucwords().
Converting a String to All Lowercase
Converting a String to All Uppercase
Capitalizing the First Letter of a String
Note that while the first letter is indeed capitalized, the capitalized word PHP was left untouched. The function lcfirst() performs the opposite action of turning the first character of a string to lowercase.
Capitalizing Each Word in a String
Note that if O’Malley was accidentally written as O’malley, ucwords() would not catch the error, as it considers a word to be defined as a string of characters separated from other entities in the string by a blank space on each side.
Converting Strings to and from HTML
Converting a string or an entire file into a form suitable for viewing on the Web (and vice versa) is easier than you would think, and it comes with some security risks. If the input string is provided by a user who is browsing the website, it could be possible to inject script code that will be executed by the browser as it now looks like that code came from the server. Do not trust the input from users. The following functions are suited for such tasks.
Converting Newline Characters to HTML Break Tags
Converting Special Characters to Their HTML Equivalents
During the general course of communication, you may come across many characters that are not included in a document’s text encoding, or that are not readily available on the keyboard. Examples of such characters include the copyright symbol (©), the cent sign (¢), and the grave accent (è). To facilitate such shortcomings, a set of universal key codes was devised, known as character entity references . When these entities are parsed by the browser, they will be converted into their recognizable counterparts. For example, the three aforementioned characters would be presented as ©, ¢, and È, respectively.
ENT_COMPAT : Convert double quotes and ignore single quotes. This is the default.
ENT_NOQUOTES : Ignore both double and single quotes.
ENT_QUOTES : Convert both double and single quotes.
htmlentities()’s Supported Character Sets
Character Set | Description |
---|---|
BIG5 | Traditional Chinese |
BIG5-HKSCS | BIG5 with additional Hong Kong extensions, traditional Chinese |
cp866 | DOS-specific Cyrillic character set |
cp1251 | Windows-specific Cyrillic character set |
cp1252 | Windows-specific character set for Western Europe |
EUC-JP | Japanese |
GB2312 | Simplified Chinese |
ISO-8859-1 | Western European, Latin-1 |
ISO-8859-5 | Little-used Cyrillic charset (Latin/Cyrillic). |
ISO-8859-15 | Western European, Latin-9 |
KOI8-R | Russian |
Shift_JIS | Japanese |
MacRoman | Charset that was used by Mac OS |
UTF-8 | ASCII-compatible multibyte 8 encode |
The final optional parameter double_encode will prevent htmlentities() from encoding any HTML entities that already exist in the string. In most cases, you’ll probably want to enable this parameter if you suspect HTML entities already exist in the target string.
Two characters are converted, the grave accent (è) and the cedilla (ç). The single quotes are ignored due to the default quote_style setting ENT_COMPAT.
Using Special HTML Characters for Other Purposes
The optional charset and double_encode parameters operate in a fashion identical to the explanation provided in the previous section on the htmlentities() function.
& becomes &
" (double quote) becomes "
' (single quote) becomes '
< becomes <
> becomes >
This function is particularly useful in preventing users from entering HTML markup into an interactive web application, such as a message board.
If the translation isn’t necessary, perhaps a more efficient way to do this would be to use strip_tags() , which deletes the tags from the string altogether.
Tip
If you are using htmlspecialchars() in conjunction with a function such as nl2br() , you should execute nl2br() after htmlspecialchars(); otherwise, the <br /> tags that are generated with nl2br() will be converted to visible characters.
Converting Text into Its HTML Equivalent
This returned value can then be used in conjunction with another predefined function, strtr() (formally introduced later in this section), to essentially translate the text into its corresponding HTML code.
Interestingly, array_flip() is capable of reversing the text-to-HTML translation and vice versa. Assume that instead of printing the result of strtr() in the preceding code sample, you assign it to the variable $translated_string .
Creating a Customized Conversion List
Converting HTML to Plain Text
Note
Another function that behaves like strip_tags() is fgetss(). This function is described in Chapter 10.
Alternatives for Regular Expression Functions
When you’re processing large amounts of information, the regular expression functions can slow matters dramatically. You should use these functions only when you are interested in parsing relatively complicated strings that require the use of regular expressions. If you are instead interested in parsing for simple expressions, there are a variety of predefined functions that speed up the process considerably. Each of these functions is described in this section.
Tokenizing a String Based on Predefined Characters
Exploding a String Based on a Predefined Delimiter
The explode() function will always be considerably faster than preg_split(). Therefore, always use it instead of the others when a regular expression isn’t necessary.
Note
You might be wondering why the previous code is indented in an inconsistent manner. The multiple-line string was delimited using heredoc syntax, which requires the closing identifier to not be indented even a single space. See Chapter 3 for more information about heredoc.
Converting an Array into a String
Performing Complex String Parsing
The function stripos() operates identically to strpos(), except that it executes its search case insensitively.
Finding the Last Occurrence of a String
Replacing All Instances of a String with Another String
If occurrence is not found in str, the original string is returned unmodified. If the optional parameter count is defined, only count occurrences found in str will be replaced.
The function str_ireplace() operates identically to str_replace(), except that it is capable of executing a case-insensitive search.
Retrieving Part of a String
The optional before_needle parameter modifies the behavior of strstr() , causing the function to instead return the part of the string that is found before the first occurrence.
Returning Part of a String Based on Predefined Offsets
If start is positive, the returned string will begin at the start position of the string.
If start is negative, the returned string will begin at the length - start position of the string.
If length is provided and is positive, the returned string will consist of the characters between start and start + length. If this distance surpasses the total string length, only the string between start and the string’s end will be returned.
If length is provided and is negative, the returned string will end length characters from the end of str.
Determining the Frequency of a String’s Appearance
The optional offset and length parameters determine the string offset from which to begin attempting to match the substring within the string, and the maximum length of the string to search following the offset, respectively.
Replacing a Portion of a String with Another String
If start is positive, replacement will begin at character start.
If start is negative, replacement will begin at str length - start.
If length is provided and is positive, replacement will be length characters long.
If length is provided and is negative, replacement will end at str length - length characters.
Padding and Stripping a String
For formatting reasons, you sometimes need to modify the string length via either padding or stripping characters. PHP provides a number of functions for doing so. This section examines many of the commonly used functions.