Chapter 14. Middleware and XML

CGI programming has been used to make individual web applications from simple guestbooks to complex programs such as a calendar capable of managing the schedules of large groups. Traditionally, these programs have been limited to displaying data and receiving input directly from users.

However, as with all popular technologies, CGI is being pushed beyond these traditional uses. Going beyond CGI applications that interact with users, the focus of this chapter is on how CGI can be a powerful means of communicating with other programs.

We have seen how CGI programs can act as a gateway to a variety of resources such as databases, email, and a host of other protocols and programs. However, a CGI program can also perform some sophisticated processing on the data it gets so that it effectively becomes a data resource itself. This is the definition of CGI middleware. In this context, the CGI application sits between the program it is serving data to and the resources that it is interacting with.

The variety of search engines that exist provides a good example of why CGI middleware can be useful. In the early history of the Web, there were only a few search engines to choose from. Now, there are many. The results these engines produce are usually not identical. Finding out about a rare topic is not an easy task if you have to jump from engine to engine to retry the search.

Instead of trying multiple queries, you would probably rather issue one query and get back results from many search engines in a consolidated form with duplicate responses already filtered out. To make this a reality, the search engines themselves must become CGI middleware engines, talking to one CGI script that consolidates the results.

Furthermore, a CGI middleware layer can be used to consolidate databases other than ones on the Internet. For example, a company-wide directory service could be programmed to search several internal phone directory databases such as customer data and human resources data as well as using an Internet phone resource such as http://www.four11.com/ if the information is lacking internally, as shown in Figure 14.1.

Consolidated phone directory interface using CGI middleware

Figure 14-1. Consolidated phone directory interface using CGI middleware

Two technologies to illustrate the use of CGI middleware will be demonstrated later in this chapter. First, we will look at how to perform network connections from your CGI scripts in order to talk to other servers. Then, we introduce eXtensible Markup Language (XML), a platform-independent way of transferring data between programs. We’ll show an example using Perl’s XML parser.

Communicating with Other Servers

Let’s look at the typical communication scheme between a client and a server. Consider an electronic mail application, for example. Most email applications save the user’s messages in a particular file, typically in the /var/spool/mail directory. When you send mail to someone on a different host, the mail application must find the recipient’s mail file on that server and append your message to it. How does the mail program achieve this task, since it cannot manipulate files on a remote host directly?

The answer to this question is interprocess communication (IPC). Typically, there exists a process on the remote host, which acts as a messenger for dealing with email services. When you send a message, the local process on your host communicates with this remote agent across a network to deliver mail. As a result, the remote process is called a server (because it services an issued request), and the local process is referred to as a client. The Web works along the same philosophy: the browser represents the client that issues a request to an HTTP server that interprets and executes the request.

The most important thing to remember here is that the client and the server must speak the same language. In other words, a particular client is designed to work with a specific server. So, for example, an email client, such as Eudora, cannot communicate with a web server. But if you know the stream of data expected by a server, and the output it produces, you can write an application that communicates with the server, as you will see later in this chapter.

Sockets

Most companies have a telephone switchboard that acts as a gateway for calls coming in and going out. A socket can be likened to a telephone switchboard. If you want to connect to a remote host, you need to first create a socket through which the communications would occur. This is similar to dialing “9” to go through the company switchboard to the outside world.

Similarly, if you want to create a server that accepts connections from remote (or local) hosts, you need to set up a socket that listens for connections. The socket is identified on the Internet by the host’s IP address and the port that it listens on. Once a connection is established, a new socket is created to handle this connection, so that the original socket can go back and listen for more connections. The telephone switchboard works in the same manner: as it handles outside phone calls, it routes them to the appropriate extension and goes back to accept more calls.

For the sake of discussion, think of a socket simply as a pipe between two locations. You can send and receive information through that pipe. This concept will make it easier for you to understand socket I/O.

IO::Socket

The IO::Socket module, which is included with the standard Perl distribution, makes socket programming simple. Example 14.1 provides a short program that takes a URL from the user, requests the resource via a GET method, then prints the headers and content.

Example 14-1. socket_get.pl

#!/usr/bin/perl -wT

use strict;

use IO::Socket;
use URI;

my $location = shift || die "Usage: $0 URL
";

my $url      = new URI( $location );
my $host     = $url->host;
my $port     = $url->port || 80;
my $path     = $url->path || "/";

my $socket   = new IO::Socket::INET (PeerAddr => $host,
                                     PeerPort => $port,
                                     Proto    => 'tcp')
               or die "Cannot connect to the server.
";

$socket->autoflush (1);

print $socket "GET $path HTTP/1.1
",
              "Host: $host

";
print while (<$socket>);

$socket->close;

We use the URI module discussed in Chapter 2, to break the URL supplied by the user into components. Then we create a new instance of the IO::Socket::INET object and pass it the host, port number, and the communications protocol. And the module takes care of the rest of the details.

We make the socket unbuffered by using the autoflush method. Notice in the next set of code that we can use the instance variable $socket as a file handle as well. This means that we can read from and write to the socket through this variable.

This is a relatively simple program, but there is an even easier way to retrieve web resources from Perl: LWP.

LWP

LWP , which stands for libwww-perl, is an implementation of the W3C’s libwww package for Perl by Gisle Aas and Martijn Koster, with contributions from a host of others. LWP allows you to create a fully configurable web client in Perl. You can see an example of some of what LWP can do in Section 8.2.5.

With LWP, we can write our web agent as shown in Example 14.2.

Example 14-2. lwp_full_get.pl

#!/usr/bin/perl -wT

use strict;
use LWP::UserAgent;
use HTTP::Request;

my $location = shift || die "Usage: $0 URL
";

my $agent = new LWP::UserAgent;
my $req = new HTTP::Request GET => $location;
   $req->header('Accept' => 'text/html'),

my $result = $agent->request( $req );

print $result->headers_as_string,
      $result->content;

Here we create a user agent object as well as an HTTP request object. We ask the user agent to fetch the result of the HTTP request and then print out the headers and content of this response.

Finally, let’s look at LWP::Simple. LWP::Simple does not offer the same flexibility as the full LWP module, but it is much easier to use. In fact, we can rewrite our previous example to be even shorter; see Example 14.3.

Example 14-3. lwp_simple_get.pl

#!/usr/bin/perl -wT

use strict;
use LWP::Simple;

my $location = shift || die "Usage: $0 URL
";

getprint( $location );

There is a slight difference between this and the previous example. It does not print the HTTP headers, just the content. If we want to access the headers, we would need to use the full LWP module instead.

An Introduction to XML

XML is useful because it provides an industry standard way of describing data. In addition, XML accomplishes this feat in a style similar to HTML, which thousands of developers are already familiar with. CGI programs that speak XML will be able to deliver to and retrieve data from any XML-compliant Perl script or Java applet.

It is possible to use CGI as middleware without a data description language such as XML. The success of libraries such as LWP for Perl demonstrates this. However, most web pages still deliver data as plain HTML. Using LWP to grab these pages and the HTML::Parser to parse them leaves much to be desired. Although HTML has to be produced in order for a web browser to consume the data even when XML is used, the HTML itself is likely to change depending on how the web designer wants the page to look, even if the data described in XML would still remain the same. For this reason, writing a parser for an HTML document can be problematic because the HTML parser will break as soon as the structure of how the data is displayed is changed.

On the client side of the coin, those projects requiring the sophisticated data-display capabilities of Java need to have some way of obtaining their data. Enabling Java applets to talk to CGI programs provides a lightweight and easy way to gather the data for presentation.

For the most part, HTML has served its purpose well. Web browsers have successfully dealt with HTML markup tags to display content to users for years. However, while human readers can absorb the data in the context of their own language, machines find it difficult to interpret the ambiguity of data written in a natural language such as English inside an HTML document. This problem brought about the recognition that what the Web needs is a language that could mark up content in a way that is easily machine-readable.

XML was designed to make up for many of HTML’s limitations in this area. The following is a list of features XML provides that makes it useful as a mechanism for transporting data from program to program:

  1. New tags and tag hierarchies can be defined to represent data specific to your application. For instance, a quiz can contain <QUESTION> and <ANSWER> tags.

  2. Document type definitions can be defined for data validation. You can require, for instance, that every <QUESTION> be associated with exactly one <ANSWER>.

  3. Data transport is Unicode-compliant, which is important for non-ASCII character sets.

  4. Data is provided in a way that makes it easily transportable via HTTP.

  5. Syntax is simple, allowing parsers to be simple.

As an example, let’s look at a sample XML document that might contain the data for an online quiz. At the most superficial level, a quiz has to be represented as a collection of questions and their answers. The XML looks like this:

<?xml version="1.0"?>
<!DOCTYPE quiz SYSTEM "quiz.dtd">
<QUIZ>
  <QUESTION TYPE="Multiple">
    <ASK>
      All of the following players won the regular season MVP and playoff
      MVP in the same year, except for:
    </ASK>
    <CHOICE VALUE="A" TEXT="Larry Bird"/>
    <CHOICE VALUE="B" TEXT="Jerry West"/>
    <CHOICE VALUE="C" TEXT="Earvin Magic Johnson"/>
    <CHOICE VALUE="D" TEXT="Hakeem Olajuwon"/>
    <CHOICE VALUE="E" TEXT="Michael Jordan"/>
    
    <ANSWER>B</ANSWER>
    <RESPONSE VALUE="B">
      West was awesome, but they did not have a playoff 
      MVP in his day.
    </RESPONSE>
    <RESPONSE STATUS="WRONG">
      How could you choose Bird, Magic, Michael, or Hakeem?
    </RESPONSE>
  </QUESTION>
  
  <QUESTION TYPE="Text">
    <ASK>
      Who is the only NBA player to get a triple-double by halftime?
    </ASK>
    
    <ANSWER>Larry Bird</ANSWER>
     <RESPONSE VALUE="Larry Bird">
       You got it! He was quite awesome!
     </RESPONSE>
     <RESPONSE VALUE="Magic Johnson">
       Sorry. Magic was just as awesome as Larry, but he never got a
       triple-double by halftime.
     </RESPONSE>
     <RESPONSE STATUS="WRONG">
       I guess you are not a Celtics Fan.
     </RESPONSE>
  </QUESTION>
</QUIZ>

You can see from the above document that XML is actually very simple, and it is very similar to HTML. This is no accident. One of XML’s primary design goals is to make it compatible with the Internet. The other major goal is to make the language so simple that it is relatively trivial to write an XML parser.

From the structure in the sample XML document, you can ascertain that the root data structure is a quiz surrounded by <QUIZ> tags. All XML documents must present the data with at least one root structure surrounding the whole document.

Within the quiz structure shown here, there are two questions. Within those questions are descriptions of the question itself, an answer to the question, and a host of possible responses.

Obviously, this input has to be accompanied by a style sheet or some other guide to the browser, so that the browser knows basic things like not displaying the answers with the questions. Later in this chapter, we will write a Perl program to translate an XML document into standard HTML.

The question tags are written with an open and closing tag to illustrate that multiple datasets (ask, answer, response) are placed between them. On the other hand, we made the choices for a multiple-choice question into single, empty tags. XML makes this clear by forcing a “/” at the end of the single tag definition.

This is one of the main areas where XML differs from HTML. HTML would just leave the single empty tag as is. However, the designers of XML felt that it was easier to write a parser if that parser knew that it did not have to look for a closing tag to accommodate the start tag as soon as it realized the single tag ends with a “/>” instead of “>” by itself.

The above XML document is arbitrarily structured. We could have presented the information in different ways.

For example, we could have made the <CHOICE> tag open instead of empty so that a choice could handle more definitions inside of itself. Using an open tag would allow a round-robin list of possible choices to present so the choices do not appear the same all the time. This is an important XML point: XML was designed to accommodate any data structure.

Document Type Definition

A document type definition (DTD) tells us how the XML document is structured and what the tags mean in relation to one another. Notice that the second line in the quiz XML example contains a document type definition indicated by a <!DOCTYPE> tag. This tag references a file that contains the DTD for this XML structure. Generally, this <!DOCTYPE> tag is used when an XML parser wants to validate the XML against a more strict definition.

For example, the XML shown above could easily be parsed without the DTD. However, the DTD may offer additional hints to the XML parser to further validate the document. Here’s a sample quiz.dtd file:

<?xml version-"1.0">
<!ELEMENT QUIZ (QUESTION*)>
<!ELEMENT QUESTION (ASK+,CHOICE*,ANSWER+,RESPONSE+)>
<!ATTLIST QUESTION
  TYPE CDATA #REQUIRED>

<!ELEMENT ASK (#PCDATA)>
<!ELEMENT CHOICE EMPTY>
    <!ATTLIST CHOICE
         VALUE CDATA #REQUIRED
         TEXT CDATA #REQUIRED>
<!ELEMENT ANSWER (#PCDATA)>
<!ELEMENT RESPONSE (#PCDATA)>
    <!ATTLIST RESPONSE
         VALUE CDATA
         STATUS CDATA>

The <!ELEMENT> tags describe the actual tags that are valid in the XML document. In this case, <QUIZ>, <QUESTION>, <ASK>, <CHOICE>, <ANSWER>, and <RESPONSE> tags are available for use in an XML document compliant with the quiz.dtd file.

The parentheses after the name of the element show what tags it can contain. The * symbol is a quantity identifier. It follows the same basic rules as regular expression matching. For example, a * symbol indicates zero or more of that element is expected to be contained. If we wanted to indicate zero or one, we would have placed a ? in place of the *. Likewise, if we wanted to indicate that one or more of that element has to be contained inside the tag, then we would have used + . #PCDATA is used to indicate that the element contains character data.

For this example, the <QUIZ> tag expects to contain zero or more QUESTION elements while the <QUESTION> tag expects to contain at least one question, answer, and response. Questions can also have zero or more choices. Furthermore, the CHOICE element definition later in the DTD uses the EMPTY keyword to indicate that it is a single tag that appears by itself; it does not enclose anything. The ASK element contains character data only.

After each element is defined, its attributes need to be laid out. Questions have a type attribute that takes a string of character data. Furthermore, the #REQUIRED keyword indicates that this data is required in the XML document. The other attribute definitions follow a similar pattern in the quiz.dtd file.

The DTD file is optional. You can still parse an XML document without a document type definition. However, with the DTD, the XML parser is provided with rules that the data validation should be based on. Maintaining these validation rules centrally allows the XML format to change without having to make as many changes to the parser code. Parsers that do not use a DTD are called non-validating XML parsers; the standard Perl module for parsing XML documents, XML::Parser, is a non-validating XML parser.

Presumably, anybody writing a quiz will use an editor that checks their XML against the DTD, or will run the document through a validating program. Thus, our program will never encounter a question that does not contain an answer, or some other violation of the DTD.

When a program knows the structure of an XML document using a DTD, it can make other assumptions on how to display that data. For example, a browser could be programmed so that when a quiz document is encountered, it will display the available questions in a list even if only one question was present in the document itself. Because the DTD tells us that it is possible for many questions to appear in the file, the browser can determine the context in which to display the data in the XML document.

The ability to decouple validation rules from the parser is especially important on the Web. With the potential for many people to write code that draws information from an XML data source, any type of mechanism that prevents changes in the XML definition from breaking those parsers will make for a more robust network.

Writing an XML Parser

The XML parser example builds on the work of the XML::Parser library available on CPAN. XML::Parser is an interface to a library written in C called expat by James Clark. Originally Larry Wall wrote the first XML::Parser library prototype for Perl. Since then, Clark Cooper has continued to develop and maintain XML::Parser. In this section, we will write a simple middleware application using XML.

The latest versions of Netscape have a feature called “What’s Related”. When the user clicks on the What’s Related button, the Netscape browser takes the URL that the user is currently viewing and looks up related URLs in a search engine. Most users don’t know that the Netscape browser is actually doing this through an XML-based search engine. Dave Winer originally wrote an article with accompanying Frontier code to access the What’s Related search engine at http://nirvana.userland.com/whatsRelated/.

Netscape maintains a server that takes URLs and returns the related URL information in an XML format. Netscape wisely chose XML because they did not intend for users to interact directly with this server using HTML forms. Instead, they expected users to choose “What’s Related” as a menu item and then have the Netscape browser do the XML parsing.

In other words, the Netscape “What’s Related” web server is actually serving as a middleware layer between the search engine database and the Netscape browser itself. We will write a CGI frontend to the Netscape application that serves up this XML to demonstrate the XML parser. In addition, we will also go one step further and automatically reissue the “What’s Related” query for each URL returned.

Before we jump into the Perl code, we need to take a look at the XML that is typically returned from the Netscape server. In this example, we did a search on What’s Related to http://www.eff.org/, the web site that houses the Electronic Frontier Foundation. Here is the returned XML:

<RDF:RDF>
<RelatedLinks>
<aboutPage href="http://www.eff.org:80/"/>
<child href="http://www.privacy.org/ipc" name="Internet Privacy Coalition"/>
<child href="http://epic.org/" name="Electronic Privacy Information Center"/>
<child href="http://www.ciec.org/" name="Citizens Internet Empowerment Coalition"/>
<child href="http://www.cdt.org/" name="The Center for Democracy and Technology"/>
<child href="http://www.freedomforum.org/" name="FREE! The Freedom Forum Online. News about free press"/>
<child href="http://www.vtw.org/speech" name="VTW Focus on Internet Censorship legislation"/>
<child href="http://www.privacyrights.org/" name="Privacy Rights Clearinghouse"/>
<child href="http://www.privacy.org/pi" name="Privacy International Home Page"/>
<child href="http://www.epic.org/" name="Electronic Privacy Information Center"/>
<child href="http://www.anonymizer.com/" name="Anonymizer, Inc."/>
</RelatedLinks>
</RDF:RDF>

This example is a little different from our plain XML example earlier. First, there is no DTD. Also, notice that the document is surrounded with an unusual tag, RDF: RDF. This document is actually in an XML-based format called Resource Description Framework, or RDF. RDF describes resource data, such as the data from search engines, in a way that is standard across data domains.

This XML is relatively straightforward. The <aboutPage> tag contains a reference to the original URL we were searching. The <child> tag contains references to all the related URLs and their titles. The <RelatedLinks> tag sandwiches the entire document as the root data structure.

CGI Gateway to XML Middleware

The following CGI script will act as a gateway parsing the XML from the Netscape What’s Related server. Given a URL, it will print out all the related URLs. In addition, it will also query the Netscape What’s Related server for all the URLs related to this list of URLs and display them. From this point onward, we will refer to URLs that are related to the first set of related URLs as second-level related URLs. Figure 14.2 shows the initial query screen while Figure 14.3 illustrates the results from a sample query. Example 14.4 shows the HTML for the initial form.

Search form for the “What’s Related” CGI script

Figure 14-2. Search form for the “What’s Related” CGI script

“What’s Related to What’s Related” results from querying http://www.eff.org/

Figure 14-3. “What’s Related to What’s Related” results from querying http://www.eff.org/

Example 14-4. whats_related.html

<HTML>
<HEAD>
    <TITLE>What's Related To What's Related Query</TITLE>
</HEAD>
<BODY BGCOLOR="#ffffff">
    <H1>Enter URL To Search:</H1>
    <HR>
    <FORM METHOD="POST">
        <INPUT TYPE="text" NAME="url" SIZE=30><P>
        <INPUT TYPE="submit" NAME="submit_query" VALUE="Submit Query">
    </FORM>
</BODY>
</HTML>

Two Perl modules will be used to provide the core data connection and translation services to the search engine. First, the library for web programming ( LWP) module will be used to grab data from the search engine. Since the What’s Related server can respond to GET requests, we use the LWP::Simple subset of LWP rather than the full-blown API. Then, XML::Parser will take the retrieved data and process it so that we can manipulate the XML using Perl data structures. The code is shown in Example 14.5.

Example 14-5. whats_related.cgi

#!/usr/bin/perl -wT

use strict;
use constant WHATS_RELATED_URL => "http://www-rl.netscape.com/wtgn?";
use vars qw( @RECORDS $RELATED_RECORDS );

use CGI;
use CGI::Carp qw( fatalsToBrowser );
use XML::Parser;
use LWP::Simple;

my $q = new CGI(  );

if ( $q->param( "url" ) ) {
    display_whats_related_to_whats_related( $q );
} else {
    print $q->redirect( "/quiz.html" );
}


sub display_whats_related_to_whats_related {
    my $q = shift;
    my $url = $q->param( "url" );
    my $scriptname = $q->script_name;
    
    print $q->header( "text/html" ),
          $q->start_html( "What's Related To What's Related Query" ),
          $q->h1( "What's Related To What's Related Query" ),
          $q->hr,
          $q->start_ul;
    
    my @related = get_whats_related_to_whats_related( $url );
    
    foreach ( @related ) {
        print $q->a( { -href => "$scriptname?url=$_->[0]" }, "[*]" ),
              $q->a( { -href => "$_->[0]" }, $_->[1] );
        
        my @subrelated = @{$_->[2]};
        
        if ( @subrelated ) {
            print $q->start_ul;
            foreach ( @subrelated ) {
                print $q->a( { -href => "$scriptname?url=$_->[0]" }, "[*]" ),
                      $q->a( { -href => "$_->[0]" }, $_->[1] );
            }
            print $q->end_ul;
        } else {
            print $q->p( "No Related Items Were Found" );
        }
    }
    
    if ( ! @related ) {
        print $q->p( "No Related Items Were Found. Sorry." );
    } 
    
    print $q->end_ul,
          $q->p( "[*] = Go to What's Related To That URL." ),
          $q->hr,
          $q->start_form( -method => "GET" ),
            $q->p( "Enter Another URL To Search:",
              $q->text_field( -name => "url", -size => 30 ),
              $q->submit( -name => "submit_query", -value => "Submit Query" )
            ),
          $q->end_form,
          $q->end_html;
}


sub get_whats_related_to_whats_related {
    my $url = shift;

    my @related = get_whats_related( $url ); 
    my $record;
    foreach $record ( @related ) {
        $record->[2] = [ get_whats_related( $record->[0] ) ];
    }
    return @related;
}


sub get_whats_related {
    my $url = shift;
    my $parser = new XML::Parser( Handlers => { Start => &handle_start } );
    my $data = get( WHATS_RELATED_URL . $url );
    
    $data =~ s/&/&amp;/g;
    while ( $data =~ s|(="[^"]*)"([^/ ])|$1'$2|g ) { };
    while ( $data =~ s|(="[^"]*)<[^"]*>|$1|g ) { };
    while ( $data =~ s|(="[^"]*)<|$1|g ) { };
    while ( $data =~ s|(="[^"]*)>|$1|g ) { };
    $data =~ s/[x80-xFF]//g;
    
    local @RECORDS = (  );
    local $RELATED_RECORDS = 1;
    
    $parser->parse( $data );
    
    sub handle_start {
        my $expat = shift;
        my $element = shift;
        my %attributes = @_;

        if ( $element eq "child" ) {
            my $href = $attributes{"href"};
            $href =~ s/http.*http(.*)/http$1/;

            if ( $attributes{"name"} &&
                 $attributes{"name"} !~ /smart browsing/i &&
                 $RELATED_RECORDS ) {
                if ( $attributes{"name"} =~ /no related/i ) {
                    $RELATED_RECORDS = 0;
                } else {
                    my $fields = [ $href, $attributes{"name"} ];
                    push @RECORDS, $fields;
                }
            }
        }
    }
    return @RECORDS;
}

This script starts like most of our others, except we declare the @RECORDS and @RELATED_RECORDS as global variables that will be used to temporarily store information about parsing the XML document. In particular, @RECORDS will contain the URLs and titles of the related URLs, and $RELATED_RECORDS will be a flag that is set if related documents are discovered by Netscape’s What’s Related server. WHATS_RELATED_URL is a constant that contains the URL of Netscape’s What’s Related server.

In addition to the CGI.pm module, we use CGI::Carp with the fatalsToBrowser option in order to make any errors echo to the browser for easier debugging. This is important because XML::Parser dies when it encounters a parsing error. XML::Parser is the heart of the program. It will perform the data extraction of the related items. LWP::Simple is a simplified subset of LWP, a library of functions for grabbing data from a URL.

We create a CGI object and then check whether we received a url parameter. If so, then we process the query; otherwise, we simply forward the user to the HTML form. To process our query, a subroutine is called to display “What’s Related to What’s Related” to the URL (display_whats_related_to_whats_related ).

The display_whats_related_to_whats_related subroutine contains the code that displays the HTML of a list of URLs that are related to the submitted URL including the second-level related URLs.

We declare a lexical variable called @related. This data structure contains all the related URL information after the data gets returned from the get_whats_related_to_whats_related subroutine.

More specifically, @related contains references to the related URLs, which in turn contain references to second-level related URLs. @related contains references to arrays whose elements are the URL itself, the title of the URL, plus another array pointing to second-level related URLs. The subarray of second-level related URLs contains only two elements: the URL and the title. Figure 14.4 illustrates this data structure.

Perl data structure that contains the related URLs and subsequent related URLs

Figure 14-4. Perl data structure that contains the related URLs and subsequent related URLs

If there are no related items found at the top level submitted URL, a message is printed to notify the user.

Later, we want to print out self-referencing hypertext links back to this script. In preparation for this action, we create a variable called $scriptname that will hold the current scriptname for referencing in <A HREF> tags. CGI.pm’s script_name method provides a convenient way of getting this data.

Of course, we could have simply chosen a static name for this script. However, it is generally considered good practice to code for flexibility where possible. In this case, we can name the script anything we want and the code here will not have to change.

For each related URL, we print out “[*]” embedded in an <A> tag that will contain a reference to the script itself plus the current URL being passed to it as a search parameter. If one element of @related contains ["http://www.eff.org/", "The Electronic Frontier Foundation"] the resulting HTML would look like this:

<A HREF="whatsrelated.cgi?url=http://www.eff.org/" >[*]</A>
<A HREF="http://www.eff.org/">The Electronic Frontier Foundation</A>

This will let the user pursue the “What’s Related” trail another step by running this script on the chosen URL. Immediately afterwards, the title ($_->[1]) is printed with a hypertext reference to the URL that the title represents ($_->[0]).

@subrelated contains the URLs that are related to the URL we just printed for the user ($_->[2]). If there are second-level related URLs, we can proceed to print them. The second-level related URL array follows the same format as the related URL array except that there is no third element containing further references to more related URLs. $_->[0] is the URL and $_->[1] is the title of the URL itself. If @subrelated is empty, the user is told that there are no related items to the URL that is currently being displayed.

Finally, we output the footer for the What’s Related query results page. In addition, the user is presented with another text field in which they can enter in a new URL to search on.

The get_whats_related_to_whats_related subroutine contains logic to take a URL and construct a data structure that contains not only URLs that are related to the passed URL, but also the second-level related URLs. @related contains the list of what’s related to the first URL.

Then, each record is examined in @related to see if there is anything related to that URL. If there is, the third element ($record->[2]) of the record is set to a reference to the second-level related URLs we are currently examining. Finally, the entire @related data structure is returned.

The get_whats_related subroutine returns an array of references to an array with two elements: a related URL and the title of that URL. The key to getting this information is to parse it from an XML document. $parser is the XML::Parser object that will be used to perform this task.

XML parsers do not simply parse data in a linear fashion. After all, XML itself is hierarchical in nature. There are two different ways that XML parsers can look at XML data.

One way is to have the XML parser take the entire document and simply return a tree of objects that represents the XML document hierarchy. Perl supports this concept via the XML::Grove module by Ken MacLeod. The second way to parse XML documents is using a SAX (Simple API for XML) style of parser. This type of parser is event-based and is the one that XML::Parser is based on.

The event based parser is popular because it starts returning data to the calling program as it parses the document. There is no need to wait until the whole document is parsed before getting a picture of how the XML elements are placed in the document. XML::Parser accepts a file handle or the text of an XML document and then goes through its structure looking for certain events. When a particular event is encountered, the parser calls the appropriate Perl subroutine to handle it on the fly.

For this program, we define a handler that looks for the start of any XML tag. This handler is declared as a reference to a subroutine called handle_start. The handle_start subroutine is declared further below within the local context of the subroutine we are discussing.

XML::Parser can handle more than just start tags. XML::Parser also supports the capability of writing handlers for other types of parsing events such as end tags, or even for specific tag names. However, in this program, we only need to declare a handler that will be triggered any time an XML start tag is encountered.

$data contains the raw XML code to be parsed. The get subroutine was previously imported by pulling the LWP::Simple module into the Perl script. When we pass WHATS_RELATED_URL along with the URL we are looking for to the get subroutine, get will go out on the Internet and retrieve the output from the “What’s Related” web server.

You will notice that as soon as $data is collected, there is some additional manipulation done to it. XML::Parser will parse only well-formed XML documents. Unfortunately, the Netscape XML server sometimes returns data that is not entirely well-formed, so a generic XML parser has a little difficulty with it.

To get around this problem, we filter out potentially bad data inside of the tags. The regular expressions in the above code respectively transform ampersands, double-quotes, HTML tags, and stray < and > characters into well-formed counterparts. The last regular expression deals with filtering out non-ASCII characters.

Before parsing the data, we set the baseline global variables @RECORDS to the empty set and $RELATED_RECORDS to true (1).

Simply calling the parse method on the $parser object starts the parsing process. The $data variable that is passed to parse is the XML subject to be read. The parse method also accepts other types of data including file handles to XML files.

Recall that the handle_start subroutine was passed to the $parser object upon its creation. The handle_start subroutine that is declared within the get_whats_related subroutine is called by XML::Parser every time a start tag is encountered.

$expat is a reference to the XML::Parser object itself. $element is the start element name and %attributes is a hash table of attributes that were declared inside the XML element.

For this example, we are concerned only with tags that begin with the name “child” and contain the href attribute. In addition, the $href tag is filtered so any non-URL information is stripped out of the URL.

If there is no name attribute, or if the name attribute contains the phrase “Smart Browsing”, or if there were no related records found previously for this URL, we do not want to add anything to the @RECORDS array. In addition, if the name attribute contains the phrase “no related”, the $RELATED_RECORDS flag is set to false (0).

Otherwise, if these conditions are not met, we will add the URL to the @RECORDS array. This is done by making a reference to an array with two elements: the URL and the title of the URL. At the end of the subroutine, the compiled @RECORDS array is returned.

This program was a simple example of using a CGI program to pull data automatically from an XML-based server. While the What’s Related server is just one XML server, it is conceivable that as XML grows, there will be more database engines on the Internet that deliver even more types of data. Since XML is the standard language for delivering data markup on the Web, extensions to this CGI script can be used to access those new data repositories.

More information about XML, DTD, RDF, and even the Perl XML::Parser library can be found at http://www.xml.com/. Of course, XML::Parser can also be found on CPAN.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.104.72