In addition to the previously described parsing functions in LIB_parse
, PHP also contains a multitude of built-in parsing functions. The following is a brief sample of the most valuable built-in PHP parsing functions, along with examples of how they are used.
You can use the stristr()
function to tell your webbot whether or not a string contains another string. The PHP community commonly uses the term haystack to refer to the entire unparsed text and the term needle to refer to the substring within the larger string. The function stristr()
looks for an occurrence of needle in haystack. If found, stristr()
returns a substring of haystack from the occurrence of needle to the end of the larger string. In normal use, you're not always concerned about the actual returned text. Generally, the fact that something was returned is used as an indication that you found the existence of needle in the haystack.
The stristr()
function is handy if you want to detect whether or not a specific word is mentioned in a web page. For example, if you want to know if a web page mentions dogs, you can execute the script shown in Listing 4-11.
if(stristr($web_page, "dogs")) echo "This is a web page that mentions dogs."; else echo "This web page does not mention dogs";
Listing 4-11: Using stristr()
to see if a string contains another string
In this example, we're not specifically interested in what the stristr()
function returns, but whether is returns anything at all. If something is returned, we know that the web page contained the word dogs.
The stristr()
function is not case sensitive. If you need a case-sensitive version of stristr()
, use strstr()
.
The PHP built-in function str_replace()
puts a new string in place of all occurrences of a substring within a string, as shown in Listing 4-12.
$org_string = "I wish I had a Cat."; $result_string = str_replace("Cat", "Dog", $org_string); # $result_string contains "I wish I had a Dog."
Listing 4-12: Using str_replace()
to replace all occurrences of Cat with Dog
The str_repalce()
function is also useful when a webbot needs to remove a character or set of characters from a string. You do this by instructing str_replace()
to replace text with a null string, as shown in Listing 4-13.
$result = str_replace("$","","$100.00"); // Remove the dollar sign # $result contains 100.00
Listing 4-13: Using str_replace()
to remove leading dollar signs
The script in Listing 4-14 uses a variety of built-in functions, along with a few functions from LIB_http
and LIB_parse
, to create a string that contains unformatted text from a website. The result is the contents of the web page without any HTML formatting.
include("LIB_parse.php"); # Include parse library include("LIB_http.php"); # Include cURL library // Download the page $web_page = http_get($target="http://www.cnn.com", $referer=""); // Remove all JavaScript $noformat = remove($web_page['FILE'], "<script", "</script>"); // Strip out all HTML formatting $unformatted = strip_tags($only_text); // Remove unwanted white space $noformat = str_replace(" ", "", $noformat); // Remove tabs $noformat = str_replace(" ", "", $noformat); // Remove non-breaking spaces $noformat = str_replace(" ", "", $noformat); // Remove line feeds echo $noformat;
Listing 4-14: Parsing the content from the HTML used on http://www.cnn.com
Sometimes it is convenient to calculate the similarity of two strings without necessarily parsing them. PHP's similar_text()
function returns a value that represents the percentage of similarity between two strings. The syntax for using similar_text()
is shown in Listing 4-15.
$similarity_percentage = similar_text($string1, $string2);
Listing 4-15: Example of using PHP's similar_text()
function
You may use similar_text()
to determine if a new version of a web page is significantly different than a cached version.
3.145.17.46