Webbots and Newsgroups

Newsgroups are a rich source of content for webbot developers. While less convenient than websites, news servers are not hard to access, especially when you have a set of functions that do most for the work for you. All of this chapter's example scripts use the LIB_nntp library. Functions in this library provide easy access to articles on news servers and create many opportunities for webbots. LIB_nntp contains functions that list newsgroups hosted by specific news servers, list available articles within newsgroups, and download particular articles. As with all libraries used in this book, the latest version of LIB_nntp is available for download at the book's website.

Identifying News Servers

Before you use NNTP, you'll need to find an accessible news server. A Google search for free news servers will provide links to some, but keep in mind that not all news servers are equal. Since few news servers host all newsgroups, not every news server will have the group you're looking for. Many free news servers also limit the number of requests you can make in a day or suffer from poor performance. For these reasons, many people prefer to pay for access to reliable news servers. You might already have access to a premium news server through your ISP. Be warned, however, that some ISPs' news servers (like those hosted by RoadRunner and EarthLink) will not allow access if you are not directly connected to a subnet in their network.

Identifying Newsgroups

Your news bots should always verify that the group you want to access is hosted by your news server. The script in Listing 14-1 uses get_nntp_groups() to create an array containing all the newsgroups on a particular news server. (Remember to put the name of your news server in place of your.news.server below.) Putting the newsgroups in an array is handy, since it allows a webbot to examine groups iteratively.

include("LIB_nntp.php");
$server = "your.news.server";
$group_array= get_nntp_groups($server);
var_dump($group_array);

Listing 14-1: Requesting (and viewing) the newsgroups available on a news server

The result of executing Listing 14-1 is shown in Figure 14-2.

Newsgroups hosted on a news server

Figure 14-2. Newsgroups hosted on a news server

Notice that Figure 14-2 only shows the newsgroups that hadn't already scrolled off the screen. In this example, my news server returned 46,626 groups. (It also required 40 seconds to download them all, so expect a short delay when requesting large amounts of data.) For each group, the server responds with the name of the group, the identifier of the first article, the identifier of the last article, and a y if you can post articles to this group or an n if posting articles to this group (on this server) is prohibited.

News servers terminate messages by sending a line that contains just a period (.), which you can see in the last array element in Figure 14-2. That lone period is the only sign your webbot will receive to tell it to stop looking for data. If your webbot reads buffers incorrectly, it will either hang indefinitely or return with incomplete data. The small function shown in Listing 14-2 (found in LIB_nntp) correctly reads data from an open NNTP network socket and recognizes the end-of-message indicator.

function read_nntp_buffer($socket)
    {
    $this_line ="";
    $buffer ="";
      while($this_line!=".
")         // Read until lone . found on line
          {
            $this_line = fgets($socket); // Read line from socket
            $buffer = $buffer . $this_line;

            }
    return $buffer;
    }

Listing 14-2: Reading NNTP data and identifying the end of messages

The script in Listing 14-1 uses the function get_nntp_groups() to get an array of available groups hosted by your news server. The script for that function is shown below in Listing 14-3.

function get_nntp_groups($server)
    {
    # Open socket connection to the mail server
    $fp = fsockopen($server, $port="119", $errno, $errstr, 30);
    if (!$fp)
        {
        # If socket error, issue error
        $return_array['ERROR'] = "ERROR: $errstr ($errno)";
        }
    else
        {
        # Else tell server to return a list of hosted newsgroups
        $out = "LIST
";
        fputs($fp, $out);
        $groups = read_nntp_buffer($fp);
        $groups_array = explode("
", $groups); // Convert to an array
        }
    fputs($fp, "QUIT 
");                      // Log out
    fclose($fp);                                  // Close socket
    return $groups_array;
    }

Listing 14-3: A function that finds available newsgroups on a news server

As you'll learn, all NNTP commands follow a structure similar to the one used in Listing 14-3. Most NNTP commands require that you do the following:

  1. Connect to the server (on port 119)

  2. Issue a command, like LIST (followed by a carriage return/line feed)

  3. Read the results (until encountering a line with a lone perioid)

  4. End the session with a QUIT command

  5. Close the network socket

Other NNTP commands that identify groups hosted by news servers are listed in RFC 997. You can use the basic structure of get_nntp_groups() as a guide to creating other functions that execute NNTP commands found in RFC 997.

Finding Articles in Newsgroups

As you read earlier, newsgroup articles are distributed among each of the news servers hosting a particular newsgroup and are physically located at each server hosting the newsgroup. Each article has a sequential numeric identifier that identifies the article on a particular news server. You may request the range of numeric identifiers for articles (for a given a newsgroup) with a script similar to the one in Listing 14-4.

include("LIB_nntp.php");
# Request article IDs
$server = "your.news.server";
$newsgroup = "alt.vacation.las-vegas";
$ids_array = get_nntp_article_ids($server, $newsgroup);

# Report Results
echo "
Info about articles in $newsgroup on $server
";
echo "Code: ".                    $ids_array['RESPONSE_CODE']."
";
echo "Estimated # of articles: ". $ids_array['EST_QTY_ARTICLES']."
";
echo "First article ID: ".        $ids_array['FIRST_ARTICLE']."
";
echo "Last article ID: ".         $ids_array['LAST_ARTICLE']."
";

Listing 14-4: Requesting article IDs from a news server

The result of running the script in Listing 14-4 is shown in Figure 14-3.

Executing get_nntp_article_ids() and displaying the results

Figure 14-3. Executing get_nntp_article_ids() and displaying the results

This function returns data in an array, with elements containing a status code,[46] the estimated quantity of articles for that group on the server, the identifier of the first article in the newsgroup, and the identifier of the last article in the newsgroup. An estimate of the number of articles is provided because some articles are deleted after submission, so not every article within the given range is actually available. It's also worth noting that each server will have its own rules for when articles become obsolete, so each server will have a different number of articles for any one newsgroup. The code that actually reads the article identifiers from the server is shown in Listing 14-5.

function get_nntp_article_ids($server, $newsgroup)
    {
    # Open socket connection to the mail server
    $socket = fsockopen($server, $port="119", $errno, $errstr, 30);
    if (!$socket)
        {
        # If socket error, issue error
        $return_array['ERROR'] = "ERROR: $errstr ($errno)";
        }
    else
        {
        # Else tell server which group to connect to
        fputs($socket, "GROUP ".$newsgroup." 
");
        $return_array['GROUP_MESSAGE']    = trim(fread($socket, 2000));

        # Get the range of available articles for this group
        fputs($socket, "NEXT 
");
        $res = fread($socket, 2000);
        $array = explode(" ", $res);

        $return_array['RESPONSE_CODE']     = $array[0];
        $return_array['EST_QTY_ARTICLES']   = $array[1];
        $return_array['FIRST_ARTICLE']      = $array[2];
        $return_array['LAST_ARTICLE']       = $array[3];
        }
    fputs($socket, "QUIT 
");
    fclose($socket);
    return $return_array;
    }

Listing 14-5: The function get_nntp_article_ids()

Reading an Article from a Newsgroup

Once you know the range of valid article identifiers for your newsgroup (on your news sever), you can request an individual article. For example, the script in Listing 14-6 reads article number 562340 from the group alt.vacation.las-vegas.

include("LIB_nntp.php");
$server = "your.news.server";
$newsgroup = "alt.vacation.las-vegas";
$article   = read_nntp_article($server, $newsgroup, $article=562340);
echo $article['HEAD'];
echo $article['ARTICLE'];

Listing 14-6: Reading and displaying an article from a news server

When you execute the code in Listing 14-6, you'll see a screen similar to the one in Figure 14-4. On my news server, article 562340 is the same article displayed in the screenshot of the Thunderbird news reader, shown earlier in Figure 14-1.[47]

Reading a newsgroup article

Figure 14-4. Reading a newsgroup article

The first part of Figure 14-4 shows the NTTP header, which, like a mail or HTTP header, returns status information about the article. Following the header is the article. Notice that in the header and at the beginning of the article, it is also referred to as <[email protected]>. Unlike the server-dependent identifier used in the previous function call, this longer identifier is universal and references this article on any news server that hosts this newsgroup.

The function called to read the news article is shown in Listing 14-7.

function read_nntp_article($server, $newsgroup, $article)
    {
    # Open socket connection to the mail server
    $socket = fsockopen($server, $port="119", $errno, $errstr, 30);

    if (!$socket)
        {
        # If socket error, issue error
        $return_array['ERROR'] = "ERROR: $errstr ($errno)";
        }

    else
        {
        # Else tell server which group to connect to
        fputs($socket, "GROUP ".$newsgroup." 
");

        # Request this article's HEAD
        fputs($socket, "HEAD $article 
");
        $return_array['HEAD'] = read_nntp_buffer($socket);

        # Request the article
        fputs($socket, "BODY $article 
");
        $return_array['ARTICLE'] = read_nntp_buffer($socket);
        }
    fputs($socket, "QUIT 
");        // Sign out (newsgroup server)
    fclose($socket);                    // Close socket
    return $return_array;               // Return data array
    }

Listing 14-7: A function that reads a newsgroup article

As mentioned earlier, NNTP was designed for use on older (slower) networks. For this reason, the article headers are available separately from the actual articles. This allowed news readers to download article headers first, to show users which articles were available on their news servers. If an article interested the viewer, that article alone was downloaded, consuming minimum bandwidth.



[46] There is a full list of NNTP status codes in Appendix B.

[47] Remember that article IDs are unique to newsgroups on each specific news server. Your article IDs are apt to be different.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.133.233