Useful PHP Functions

In addition to the previously described parsing functions in LIB_parse, PHP also contains a multitude of built-in parsing functions. The following is a brief sample of the most valuable built-in PHP parsing functions, along with examples of how they are used.

Detecting Whether a String Is Within Another String

You can use the stristr() function to tell your webbot whether or not a string contains another string. The PHP community commonly uses the term haystack to refer to the entire unparsed text and the term needle to refer to the substring within the larger string. The function stristr() looks for an occurrence of needle in haystack. If found, stristr() returns a substring of haystack from the occurrence of needle to the end of the larger string. In normal use, you're not always concerned about the actual returned text. Generally, the fact that something was returned is used as an indication that you found the existence of needle in the haystack.

The stristr() function is handy if you want to detect whether or not a specific word is mentioned in a web page. For example, if you want to know if a web page mentions dogs, you can execute the script shown in Listing 4-11.

if(stristr($web_page, "dogs"))
    echo "This is a web page that mentions dogs.";
else
    echo "This web page does not mention dogs";

Listing 4-11: Using stristr() to see if a string contains another string

In this example, we're not specifically interested in what the stristr() function returns, but whether is returns anything at all. If something is returned, we know that the web page contained the word dogs.

The stristr() function is not case sensitive. If you need a case-sensitive version of stristr(), use strstr().

Replacing a Portion of a String with Another String

The PHP built-in function str_replace() puts a new string in place of all occurrences of a substring within a string, as shown in Listing 4-12.

$org_string = "I wish I had a Cat.";
$result_string = str_replace("Cat", "Dog", $org_string);
# $result_string contains "I wish I had a Dog."

Listing 4-12: Using str_replace() to replace all occurrences of Cat with Dog

The str_repalce() function is also useful when a webbot needs to remove a character or set of characters from a string. You do this by instructing str_replace() to replace text with a null string, as shown in Listing 4-13.

$result = str_replace("$","","$100.00");    // Remove the dollar sign
# $result contains 100.00

Listing 4-13: Using str_replace() to remove leading dollar signs

Parsing Unformatted Text

The script in Listing 4-14 uses a variety of built-in functions, along with a few functions from LIB_http and LIB_parse, to create a string that contains unformatted text from a website. The result is the contents of the web page without any HTML formatting.

include("LIB_parse.php");    # Include parse library
include("LIB_http.php");     # Include cURL library

// Download the page
$web_page = http_get($target="http://www.cnn.com", $referer="");

// Remove all JavaScript
$noformat = remove($web_page['FILE'], "<script", "</script>");
// Strip out all HTML formatting
$unformatted = strip_tags($only_text);

// Remove unwanted white space
$noformat = str_replace("	", "", $noformat);     // Remove tabs
$noformat = str_replace("&nbsp;", "", $noformat); // Remove non-breaking spaces
$noformat = str_replace("
", "", $noformat);     // Remove line feeds
echo $noformat;

Listing 4-14: Parsing the content from the HTML used on http://www.cnn.com

Measuring the Similarity of Strings

Sometimes it is convenient to calculate the similarity of two strings without necessarily parsing them. PHP's similar_text() function returns a value that represents the percentage of similarity between two strings. The syntax for using similar_text() is shown in Listing 4-15.

$similarity_percentage = similar_text($string1, $string2);

Listing 4-15: Example of using PHP's similar_text() function

You may use similar_text() to determine if a new version of a web page is significantly different than a cached version.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.17.46