Quick start - developing a bot

Now that we have covered the development environment, we can get to the fun material, that is, programming our first bot. To begin with, we will develop an HTTP package, which we can use in our bot applications to handle HTTP requests and responses.

Step 1 – HTTP request classes

Now that we have discussed our development environment, coding standards, and how basic HTTP requests and responses work, let's create an HTTP request class that actually does something useful. Again, if you are not familiar with developing PHP classes, and using PHP objects, you will want to research on these topics before reading this section. While this book is a starter book for developing PHP bots, I want to teach you the correct way to develop bots, which means using the power and reusability of classes and objects, also known as OOP (Object-Oriented Programming).

When we use well-designed classes, we can use them as objects in our current project, which we are developing for those classes. Also, other future projects that require the same type of functionality and logic can use these classes as objects. If you desire to be a productive and successful programmer, you will eventually need to accept this point of view, if you haven't already.

In this section, I am going to help you develop a basic HTTP request class which we will be able to use in all of our bots that need this type of tool. Later, we will create an HTTP response class that will easily allow our request class to return objects instead of arrays of information, which will be very useful.

Before you start developing a class, you should always spend some time primarily designing it. This way you won't have to think as much while you are developing the class. So, let's spend a little time thinking about the design of our HTTP request class.

Here are some requirements we will need in our HTTP request class:

  • We will need methods for GET and HEAD HTTP requests
  • The request methods should be simple, we only want to pass a URL and a timeout value to the GET and HEAD methods
  • We want the request methods to return HTTP response objects, not arrays or other scalar values

These are some good requirements for a basic HTTP request class. Obviously, we could go much further with the design of our class and add other advanced options and methods such as debugging mode and logging. However, the we are developing this project, will get carried out just fine.

Before we start developing the HTTP request class, let's set up a project directory structure through the following steps. You will be able to use this directory structure for all the projects in this book.

  1. Create the following directory structure and files (the files can be empty for now) in a desired location on your web server:
    '-- project_directory
    |-- lib
    |      '-- HTTP
    |            |--Request.php
    |            '--Response.php
    |-- 01_command_line_app.php (from command line application 
    example)
    '-- 02_http.php
  2. Now, place all our class files (library files) in the project_directory/lib directory.
  3. Open the Request.php file.
  4. Add the first few lines of code for our Request class located at project_directory/lib/HTTP/:
    <?phpnamespace HTTP;
    
    /*** HTTP Request class – execute HTTP get and head requests
     ** @package HTTP
     */
    class Request
    {

First, in our Request.php file we set the namespace of HTTP. This is a simple way of telling PHP that we want the Request class in the HTTP container. If you are not familiar with namespaces, you can research the topic further. However, for now, you can think of a namespace as a container. So, for every class, function, method, or constant in the namespace (or container) HTTP should have something to do with HTTP logic. So, for example, if we had the namespace Database, everything in that namespace would include logic and methods for database functionality.

Next, we leave some simple, yet helpful, comments that will allow any developer to quickly determine what the class is used for. And finally, we declare our class name as Request. One important thing to note here is that this code will not work on any PHP Version prior to PHP 5.3, because PHP didn't include namespace syntax until PHP 5.3.

The first class method we are going to add to our Request class is a method that can format a timeout value if a timeout value has been used, or return a default timeout value if no timeout value has been used. This method will make more sense later on when we develop the request methods. Here are the next lines of code in our Request class, used for the formatTimeout method:

Insert the following snippet of code to our Request class located at project_directory/lib/HTTP/:

/**
 * Format timeout in seconds, if no timeout use default timeout
 *
 * @param int|float $timeout (seconds)
 * @return int|float
 */
private static function __formatTimeout($timeout = 0)
{
  $timeout = (float)$timeout; // format timeout value

  if($timeout < 0.1)
  {
    $timeout = 60; // default timeout
  }

  return $timeout;
}

As you can see, this is a very simple method that takes an optional parameter called timeout and formats the timeout value so that it can be used properly in our request methods. If no timeout value is passed to the method, it will return the default timeout value (60 seconds).

Next, we need to create a method that can parse the raw HTTP response that we receive from our request methods. Once the raw HTTP response has been parsed, we can use the response parts with our HTTP Response class (we'll build later) to form a usable response object. Here is the parse response method:

Insert the following snippet of code to our Request class:

    /**
     * Parse HTTP response
     *
     * @param string $body
     * @param array $header
     * @return HTTPResponse
     */
    private static function __parseResponse($body, $header)
    {
        $status_code = 0;
        $content_type = '';

        if(is_array($header) && count($header) > 0)
      {
            foreach($header as $v)
            {
                // ex: HTTP/1.x XYZ Message
                if(substr($v, 0, 4) == 'HTTP' 
                    && strpos($v, ' ') !== false) 
                {
                    $status_code = (int)substr($v, 
                        strpos($v, ' '), 4); // parse status code
                }
                // ex: Content-Type: *; charset=*
                else if(strncasecmp($v, 'Content-Type:', 13) === 0) 
                {
                    $content_type = $v;
                }
            }
        }
        
        return new HTTPResponse($status_code,
            $content_type, $body, $header);
    }

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

This parse method takes two arguments: body (string) and the header (array). In the method, we first initialize the variables for status code and content type. These variables will be used in the HTTP Response object, which is returned by the method. The next part of the method loops through the HTTP header and parses out the status code (HTTP response code) and the content type (MIME type). These variables are also used when creating the HTTP Response object. And finally, we instantiate the HTTP Response object—we build in the next section—and return it. This method will make more sense once we have finished creating the entire class, but we needed to include it first so that our request methods can utilize it.

Now, let's build the first request method of our Request class—the get() request method. The get() method will be the most common request method that we will use in our bots. Here is the code for the get() method:

Insert the following snippet of code to our Request class:

    /**
     * Execute HTTP GET request
     *
     * @param string $url
     * @param int|float $timeout (seconds)
     * @return HTTPResponse
     */
    public static function get($url, $timeout = 0)
    {
        $context = stream_context_create();
        stream_context_set_option($context, 'http', 'timeout', self::__formatTimeout($timeout));

        $http_response_header = NULL; // allow updating

        $res_body = file_get_contents($url, false, $context);

        return self::__parseResponse($res_body, $http_response_header);
    }

The preceding method is very simple. We tell the get() method—through the parameters—which URL to send to the GET request. Then we can pass an optional timeout value. Our get() method then creates the required context for the HTTP GET request, sends the request, and creates and returns an HTTPResponse object that we can use.

Next, we can build the head() method. An HTTP HEAD request is a very simple GET request that does not include a response message body. This request can be useful for simple requests, such as pinging an HTTP service on a web server.

Here is the code for the HEAD request. Insert the following snippet of code to our Request class located at project_directory/lib/HTTP/:

    /**
     * Execute HTTP HEAD request
     *
     * @param string $url
     * @param int|float $timeout
     * @return HTTPResponse
     */
    public static function head($url, $timeout = 0)
    {
        $context = stream_context_create();

        stream_context_set_option($context, [
            'http' => [
                'method' => 'HEAD',
                'timeout' => self::__formatTimeout($timeout)
            ]
        ]);

        $http_response_header = NULL; // allow updating

        $res_body =    file_get_contents($url, false, $context);

        return self::__parseResponse($res_body, $http_response_header);
    }

The preceding method is very straightforward. First, we accept the URL and timeout values. Then, we configure a custom stream_context, which is used when we fetch the HTTP HEAD request. Then, we use the built-in PHP function file_get_contents() to send the HTTP HEAD request. Finally, we return the HTTPResponse object, which is created in our __parseResponse() method.

Step 2 – HTTP response class

Now, we need to create an HTTP Response class that can be used to easily transform our HTTP response data into a useable object. This way, once we receive an HTTP response in our Request class, we simply pass the response data to our Response object and we don't have to think about all the details and methods required to use the HTTP response data in our applications.

Create a file called Response.php in the HTTP package directory and add the code available from the book source code download file project_directory/lib/HTTP/Response.php at Packt Publishing's website.

As you can see, it takes quite a few lines of code to create our HTTPResponse class, however, the class is very simple.

First, we define HTTP response status codes and status code messages. Status codes are used by web servers to denote the response status. For example, if an HTTP GET request is sent to a web server and the web server returns an HTTP response status code of 200, we know that the request has been handled successfully. You can see by looking at the Response class status codes and status code messages, that there are a variety of status codes and messages which the web server can respond.

Next, in the __construct() method of the Response class, we accept the status code, type (content type), body, and header. These are the only parameters that we need to perform operations such as initialization, or instantiation with the Response object in order to make it work properly for us.

Why use objects?

Now that we have our HTTPRequest and HTTPResponse objects completed, let's take a look at why we are developing classes and using objects (Object-Oriented Programming or OOP) instead of using procedural lines of code (Procedure-Oriented programming or POP). To illustrate this point, I am going to have you execute your first HTTP HEAD request using our HTTP classes.

Create the a file called 02_http.php in your project directory where it will have access to the /project_directory/lib/HTTP directory. Add the following code to the 02_http.php file located at /project_directory/:

<?php
/**
 * Example HTTP GET request
 */

// include our classes
require_once './lib/HTTP/Request.php';
require_once './lib/HTTP/Response.php';

// execute example HTTP GET request
$response = HTTPRequest::head('http://www.google.com'),

// print out HTTP response (HTTPResponse object)
echo '<pre>' . print_r($response, true) . '</pre>';

In this code, first we include our Request and Response classes that we have developed. Next, we set the response variable with the response (HTTPResponse) object that is created by the HTTPRequest::head() method. Finally, we print the HTTP Response object for illustration, or debugging/testing purposes. If you execute this code in a web browser, you see something like the following:

HTTPResponse Object
(
    [__body:HTTPResponse:private] => 
    [__encoding:HTTPResponse:private] => ISO-8859-1
    [__header:HTTPResponse:private] => Array
        (
            [0] => HTTP/1.0 200 OK
            [1] => Date: (date/time) GMT
            [2] => Expires: -1
            [3] => Cache-Control: private, max-age=0
            [4] => Content-Type: text/html; charset=ISO-8859-1
            [5] => Set-Cookie: ***
            [6] => Set-Cookie: ***
            [7] => P3P: ***
            [8] => Server: gws
            [9] => X-XSS-Protection: 1; mode=block
            [10] => X-Frame-Options: SAMEORIGIN
        )

    [__mime:HTTPResponse:private] => text/html
    [__status:HTTPResponse:private] => 200
    [__status_message:HTTPResponse:private] => OK
    [success] => 1
)

Success! We have successfully executed an HTTP HEAD request, received a response, parsed it, created an HTTP Response object, and printed the object. Now, we could easily use the object for more useful things; for example, change the 02_http.php file located at /project_directory/ as follows:

// display response status
if($response->success)
{
    echo 'Successful request <br />';
}
else
{
    echo 'Error: request failed, status code: ' 
    . $response->getStatusCode() . '<br />'; // prints status code
}

If we were using procedural programming (POP) instead of classes and objects (OOP), it would be much more difficult to do this. Also, using classes with namespaces makes it easy for us to use them in other application frameworks and not have class naming conflicts. Also, this approach of programming makes it much easier to determine what type of logic and purpose a class is designed. For example, it would be easy for another programmer to conclude that the HTTPRequest class is used to generate HTTP requests.

Step 3 – using bootstrap files

Now that we have our HTTP package—set by the namespace HTTP—completed, we can easily use it for other projects and applications. Sometimes, especially when using large packages or library files, it is hard to remember or find out what exactly needs to take place in order for us to get a package ready for use in our own software applications. A package might require extensive configuration settings, class autoloading, common file loading, external package or library files, and more.

A simple solution to this problem is using a bootstrap file. A bootstrap file can be used to initialize everything the package requires to load and initialize properly. Our HTTP package doesn't require much file loading or any configuration settings, but for the sake of example, let's create a simple bootstrap file for our HTTP package:

  1. Create a file called bootstrap.php in the HTTP package directory.
  2. Add the following code to our bootstrap class located at project_directory/lib/HTTP/:
    <?php
    namespace HTTP;
    
    /**
     * Bootstrap file
     *
     * @package HTTP
     */
    
    // load class files
    require_once './lib/HTTP/Request.php';
    require_once './lib/HTTP/Response.php';

    The preceding code resembles a very simple bootstrap file. We are simply loading the classes required in our HTTP package. However, it will make the use of our HTTP package even easier!

  3. Now, in our 02_http.php file, modify the code to use our bootstrap.php file instead of loading the class files manually:
    <?php
    /**
     * Example HTTP GET request
     */
    
    // load HTTP package with bootstrap file
    require_once './lib/HTTP/bootstrap.php';
    
    // execute example HTTP GET request
    
    $response = HTTPRequest::get('http://www.google.com'),
    
    // display response status
    if($response->success)
    {
        echo 'Successful request <br />';
    }
    else
    {
        echo 'Error: request failed, status code: ' 
        . $response->getStatusCode() . '<br />'; // prints  status code
    }
    
    // print out HTTP response (HTTPResponse object)
    echo '<pre>' . print_r($response, true) . '</pre>';

Although this is an extremely simple example of how a bootstrap file can be used to make the use of a package easier, it is still beneficial to any programmer who uses our code. Also, in later sections of this book we will be using bootstrap files when we develop our bot package.

Step 4 – creating our first bot, WebBot

At this point in the book, you should be aware of and comfortable with HTTP requests and responses, how to develop HTTP packages (covered earlier in the book), and why we use bootstrap files.

With the knowledge you have gained, we are now ready to develop our first bot, which will be a simple bot that gathers data (documents) based on a list of URLs and datasets (field and field values) that we will require.

First, let's start by creating our bot package directory. So, create a directory called WebBot so that the files in our project_directory/lib directory look like the following:

'-- project_directory|-- lib
    |    |-- HTTP (our existing HTTP package)
    |    |    '-- (HTTP package files here)
    |    '-- WebBot
    |        |-- bootstrap.php|        |-- Document.php
    |        '-- WebBot.php 
|-- (our other files)'-- 03_webbot.php

As you can see, we have a very clean and simple directory and file structure that any programmer should be able to easily follow and understand.

Step 5 – the WebBot class

Next, open the file WebBot.php file and add the code from the book source code download file project_directory/lib/WebBot/WebBot.php at Packt Publishing's website.

In our WebBot class, we first use the __construct() method to pass the array of URLs (or documents) we want to fetch, and the array of document fields are used to define the datasets and regular expression patterns. Regular expression patterns are used to populate the dataset values (or document field values). If you are unfamiliar with regular expressions, now would be a good time to study them. Then, in the __construct() method, we verify whether there are URLs to fetch or not. If there , we set an error message stating this problem.

Next, we use the __formatUrl() method to properly format URLs we fetch data. This method will also set the correct protocol: either HTTP or HTTPS (Hypertext Transfer Protocol Secure). If the protocol is already set for the URL, for example http://www.[dom].com, we ignore setting the protocol. Also, if the class configuration setting conf_force_https is set to true, we force the HTTPS protocol again unless the protocol is already set for the URL.

We then use the execute() method to fetch data for each URL, set and add the Document objects to the array of documents, and track document statistics. This method also implements fetchdelay logic that will delay each fetch by x number of seconds if set in the class configuration settings conf_delay_between_fetches. We also include the logic that only allows distinct URL fetches, meaning that, if we have already fetched data for a URL we won't fetch it again; this eliminates duplicate URL data fetches. The Document object is used as a container for the URL data, and we can use the Document object to use the URL data, the data fields, and their corresponding data field values.

In the execute() method, you can see that we have performed a HTTPRequest::get() request using the URL and our default timeout value—which is set with the class configuration settings conf_default_timeout. We then pass the HTTPResponse object that is returned by the HTTPRequest::get() method to the Document object. Then, the Document object uses the data from the HTTPResponse object to build the document data.

Finally, we include the getDocuments() method, which simply returns all the Document objects in an array that we can use for our own purposes as we desire.

Step 6 – the WebBot Document class

Next, we need to create a class called Document that can be used to store document data and field names with their values. To do this we will carry out the following steps:

  1. We first pass the data retrieved by our WebBot class to the Document class.
  2. Then, we define our document's fields and values using regular expression patterns.
  3. Next, add the code from the book source code download file project_directory/lib/WebBot/Document.php at Packt Publishing's website.

    Our Document class accepts the HTTPResponse object that is set in WebBot class's execute() method, and the document fields and document ID.

  4. In the Document __construct() method, we set our class properties: the HTTP Response object, the fields (and regular expression patterns), the document ID, and the URL that we use to fetch the HTTP response.
  5. We then check if the HTTP response successful (status code 200), and if it isn't, we set the error with the status code and message.
  6. Lastly, we call the __setFields() method.

The __setFields() method parses out and sets the field values from the HTTP response body. For example, if in our fields we have a title field defined as $fields = ['title' => '<title>(.*)</title>'];, the __setFields() method will add the title field and pull all values inside the <title>*</title> tags into the HTML response body. So, if there were two title tags in the URL data, the __setField() method would add the field and its values to the document as follows:

['title'] => [
    0 => 'title x',
    1 => 'title y'
]

If we have the WebBot class configuration variable—conf_include_document_field_raw_values—set to true, the method will also add the raw values (it will include the tags or other strings as defined in the field's regular expression patterns) as a separate element, for example:

['title'] => [
    0 => 'title x', 
    1 => 'title y', 
    'raw' => [
        0 => '<title>title x</title>',
        1 => '<title>title y</title>'
    ]
]

The preceding code is very useful when we want to extract specific data (or field values) from URL data.

To conclude the Document class, we have two more methods as follows:

  • getFields(): This method simply returns the fields and field values
  • getHttpResponse(): This method can be used to get the HTTPResponse object that was originally set by the WebBot execute() method

This will allow us to perform logical requests to internal objects if we wish.

Step 7– the WebBot bootstrap file

Now we will create a bootstrap.php file (at project_directory/lib/WebBot/) to load the HTTP package and our WebBot package classes, and set our WebBot class configuration settings:

<?php
namespace WebBot;

/**
 * Bootstrap file
 *
 * @package WebBot
 */

// load our HTTP package
require_once './lib/HTTP/bootstrap.php';

// load our WebBot package classes
require_once './lib/WebBot/Document.php';
require_once './lib/WebBot/WebBot.php';

// set unlimited execution time
set_time_limit(0);

// set default timeout to 30 seconds
WebBotWebBot::$conf_default_timeout = 30;

// set delay between fetches to 1 seconds
WebBotWebBot::$conf_delay_between_fetches = 1;

// do not use HTTPS protocol (we'll use HTTP protocol)
WebBotWebBot::$conf_force_https = false;

// do not include document field raw values
WebBotWebBot::$conf_include_document_field_raw_values = false;

We use our HTTP package to handle HTTP requests and responses. You have seen in our WebBot class how we use HTTP requests to fetch the data, and then use the HTTP Response object to store the fetched data in the previous two sections. That is why we need to include the bootstrap file to load the HTTP package properly.

Then, we load our WebBot package files. Because our WebBot class uses the Document class, we load that class file first.

Next, we use the built-in PHP function set_time_limit() to tell the PHP interpreter that we want to allow unlimited execution time for our script. You don't necessarily have to use unlimited execute time. However, for testing reasons, we will use unlimited execution time for this example.

Finally, we set the WebBot class configuration settings. These settings are used by the WebBot object internally to make our bot work as we desire. We should always make the configuration settings as simple as possible to help other developers understand. This means we should also include detailed comments in our code to ensure easy usage of package configuration settings.

We have set up four configuration settings in our WebBot class. These are static and public variables, meaning that we can set them from anywhere after we have included the WebBot class, and once we set them they will remain the same for all WebBot objects unless we change the configuration variables. If you do not understand the PHP keyword static, now would be a good time to research this subject.

  • The first configuration variable is conf_default_timeout. This variable is used to globally set the default timeout (in seconds) for all WebBot objects we create. The timeout value tells the HTTPRequest class how long it continue trying to send a request before stopping and deeming it as a bad request, or a timed-out request. By default, this configuration setting value is set to 30 (seconds).
  • The second configuration variable—conf_delay_between_fetches—is used to set a time delay (in seconds) between fetches (or HTTP requests). This can be very useful when gathering a lot of data from a website or web service. For example, say, you had to fetch one million documents from a website. You wouldn't want to unleash your bot with that type of mission without fetch delays because you could inevitably cause—to that website—problems due to massive requests. By default, this value is set to 0, or no delay.
  • The third WebBot class configuration variable—conf_force_https—when set to true, can be used to force the HTTPS protocol. As mentioned earlier, this will not override any protocol that is already set in the URL. If the conf_force_https variable is set to false, the HTTP protocol will be used. By default, this value is set to false.
  • The fourth and final configuration variable—conf_include_document_field_raw_values—when set to true, will force the Document object to include the raw values gathered from the ' regular expression patterns. We've discussed configuration settings in detail in the WebBot Document Class section earlier in this book. By default, this value is set to false.

Step 8 – the WebBot execution

Now that we have our WebBot class, WebBot Document class, and WebBot bootstrap file completed, we can start testing our bot. Add the following code to the 03_webbot.php file located at project_directory/:

<?php
/**
 * WebBot example
 */

// load WebBot library with bootstrap
require_once './lib/WebBot/bootstrap.php';

// URLs to fetch data from
$urls = [
    'search' => 'www.google.com',
    'chrome' => 'www.google.com/intl/en/chrome/browser/',
    'products' => 'www.google.com/intl/en/about/products/'
];

// document fields [document field ID => document field regex //  pattern, [...]]
$document_fields = [
    'title' => '<title.*>(.*)</title>',
    'h2' => '<h2[^>]*?>(.*)</h2>',
];

// set WebBot object
$webbot = new WebBotWebBot($urls, $document_fields);

// execute fetch data from URLs
$webbot->execute();

// display documents summary
echo $webbot->total_documents . ' total documents <br />';
echo $webbot->total_documents_success . ' total documents fetched successfully <br />';
echo $webbot->total_documents_failed . ' total documents failed to fetch <br /><br />';


// check if fetch(es) successful
if($webbot->success)
{
    // display each document
    foreach($webbot->getDocuments() as /* WebBotDocument */ $document)
    {
        if($document->success) // was document data fetched  successfully?
        {
            // display document meta data
            echo 'Document: ' . $document->id . '<br />';
            echo 'URL: ' . $document->url . '<br />';

            // display/print document fields and values
            $fields = $document->getFields();
            echo '<pre>' . print_r($fields, true) . '</pre>';
        }
        else // failed to fetch document data, display error
        {
            echo 'Document error: ' . $document->error . '<br />';
        }
    }
}
else // not successful, display error
{
    echo 'Failed, error: ' . $webbot->error;
}

Primarily, we load our WebBot package and its configuration by including the WebBot bootstrap file. Next, we set the variable urls, which is an array of URLs that we want to fetch, and convert it into WebBot Document objects. In this example, I am using www.google.com as the URL. This is for example purposes only, and you should use your own URLs. Use URLs that use HTML tags such as <title>*</title> and <h2>*</h2>.

Subsequently, we set the document_fields variable with an array of field IDs and field ' regular expression patterns. In the previous example, we are defining the document fields: title and h2. The title field will include all values in the URL's data (or HTTP response body) are in the <title>*</title> HTML tags. Likewise, the h2 field will include all values in the URL's data that are within the <h2>*</h2> HTML tags. Again, if you are not familiar with regular expression patterns, you should read more about the topic.

Note

Regular expressions are a very useful tool to have in your programmer's toolbox.

In the next line of code, we set the webbot variable with a WebBot object. We pass our urls and document_fields variables to the object constructor method. These are the only parameters required by our WebBot object, which makes it very simple to use and understand.

Following the instantiation of the WebBot object, we call the WebBot object's execute() method. This tells the WebBot object to start fetching the URL's data by making HTTP requests, and then build a document array of WebBot class Document objects.

In the next block of code, we test if the WebBot object has successfully executed the fetches by checking the success class property. If the success property is true, this doesn't necessarily mean that every URL fetch was executed successfully; it simply means the object was able to call an HTTP GET request for each URL.

In the next section, we loop through each document—that we get from the WebBot method getDocuments()—and test if the document—or the HTTP response body—was retrieved properly, using the Document class property called success. If the Document class is ready, or was retrieved properly, we display the document ID, the document URL, and print the document fields and field values. Obviously, in real-world applications we would do something more useful with this data, but for this example we can see the results our bot has generated. If the document wasn't retrieved properly, perhaps the HTTP request encountered a 404 status code (request not found). This is where we will display the document error, which is the HTTP status code and status code message.

Step 9 – the WebBot results

When we execute the project_directory/03_webbot.php file in a web browser, we see something like the following:

3 total documents 
3 total documents fetched successfully 
0 total documents failed to fetch 

Document: search
URL: http://www.google.com
Array
(
    [title] => Array
        (
            [0] => Google
        )

    [h2] => Array
        (
            [0] => Account Options
        )

)

Document: chrome
URL: http://www.google.com/intl/en/chrome/browser/
Array
(
    [title] => Array
        (
            [0] => 
      Chrome Browser
    
        )

    [h2] => Array
        (
            [0] => 
                Customize your browser
              
            [1] => 
                Get Chrome for Mobile
              
            [2] => 
                Up to 15 GB free storage
              
            [3] => 
            Get a fast, free web browser
          
        )

)

Document: products
URL: http://www.google.com/intl/en/about/products/
Array
(
    [title] => Array
        (
            [0] => Google  - Products
        )

    [h2] => Array
        (
            [0] => Web[1] => Mobile
            [2] => Media
            [3] => Geo
            [4] => Specialized Search
            [5] => Home & Office
            [6] => Social
            [7] => Innovation
        )

)

We can see from the results that our bot is operating as expected. First, we can see that we fetched a total of three documents, three documents were fetched successfully, and the bot failed to fetch zero documents. We can then see the title, URL, and fields (and field values) for each document. The bot has successfully parsed out each field and field value. This type of bot would be useful for harvesting various types of documents and document fields from desired websites or web services.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.8.247