Instrumenting a Docbase for Collaborative Review

To transform an XML repository into an HTML docbase with an NNTP discussion component, you need to do the following three things.

Insert link targets into the docbase

Links in the discussion area point back to these targets, as do links in the web-based table of contents.

Insert comment links into the docbase

These links invoke the comment form, or rather the script that generates that form. The links encode the information that the script needs to produce an NNTP message that will bind to the right spot in the newsgroup and that will point back to the right spot in the docbase.

Create the initial discussion framework

The docbase’s headers (h1..h6) define the desired structure. To populate the newsgroup accordingly, you generate a set of NNTP messages whose Message-ID: and References: headers correspond to that structure; then load those messages using one of several techniques. Let’s consider each of these three steps in more detail.

Inserting Link Targets into the Docbase

We want to translate <p> or <li> into <a name="252"><p> or <a name="1124"><li> so that comments posted regarding these elements can point back to the right spot in the text.

Although the final solution I’ll present uses Perl’s XML::Parser, the examples in Example 9.3 and Example 9.4 use two other parsers, one driven by Java and one by JavaScript. Why? There’s more than one way to do it, and that can come in handy when you’re stuck. For example, when I started working with XML, I’d rather have used Perl, but the XML::Parser module wasn’t quite ready at the time. No matter. At the end of the day, a component is just a component. What matters is getting the job done, not which programming language you use. There isn’t One True Language for the successful developer of Internet groupware. This book includes examples of Perl, Java, JavaScript, Visual Basic, SQL, and C. If I had eschewed XML because I couldn’t (at the time) write parser-based Perl scripts, I would have been cutting off my nose to spite my face. Value resides in components, not in programming languages.

There are macrocomponents—the clients and servers that make up the mail/news/Web trio—and there are microcomponents that can bind the macrocomponents into useful new configurations. Keep an open mind and a well-stocked toolkit. Microcomponents such as XML parsers and NNTP interface modules come in many varieties. When you need one that doesn’t happen to come in your favorite flavor, try a different flavor. If you’re a Perl programmer, but the component you need happens to come in only the Python or Java flavor, it may be quicker to learn the little bit of Python or Java you’ll need to use that component than to reinvent it in Perl. That’s particularly true in web environments where, as we’ve seen, parts can easily combine. In the case of my first reviewable-docbase builder, for example, links inserted into the generated docbase by a Java program invoked CGI scripts written in Perl.

Example 9.3 shows a Java-based solution to the problem of instrumenting a docbase with link targets. It uses the DataChannel/Microsoft XJ Parser.

Example 9-3. Inserting Link Targets Using the DataChannel/Microsoft XJ Parser

import java.util.*;
import java.io.*;
import java.net.*;                        
import com.datachannel.xml.om.*;

public class parseXML
{

static int element = 0;

public final static void main(String argv[])
    {
    String myURL = "book.xml";

    boolean caseSensitive = false;
    boolean validating = true;
    boolean preserveWhiteSpaces = false;
    Document doc = new Document();
    try
        {
        doc.load(myURL);
        traverse( (IXMLDOMNode) doc.getDocumentElement());
        }
    catch (Exception e)
        {
        e.printStackTrace();        
        }
    }

public static void traverse (IXMLDOMNode node)
    {
    XMLDOMNamedNodeMap attrMap = (XMLDOMNamedNodeMap) node.getAttributes();

    XMLDOMNodeList childList = (XMLDOMNodeList)node.getChildNodes();
    
    if ( node.getNodeType() == node.ELEMENT_NODE )  
        {
        if ( node.getNodeName().equals("p") ||
             node.getNodeName().equals("li") )
            {
            System.out.print( "<a name="" + element++ + "">");
            }
        System.out.print( "<" + node.getNodeName() );
        IXMLDOMNode attr = attrMap.nextNode();
        while ( attr != null )
            {
            System.out.print ( " " + attr.getNodeName() + "="" + 
                               attr.getNodeValue() + """);
            attr = attrMap.nextNode();
            }
        System.out.println(">");
        }
    else if ( ( node.getNodeType() == node.TEXT_NODE ) )
        {
        System.out.println(node.getNodeValue());
        }
    else if ( ( node.getNodeType() == node.ENTITY_NODE ) )
        {
        System.out.print ( node.getNodeValue() ); 
        }
    else 
        {
        System.out.println ( "
node: " + node.getNodeType()) ;
        }

    IXMLDOMNode child = childList.nextNode() ;
    while ( child != null )
        {
        traverse(child);
        child = childList.nextNode();
        }

    if ( node.getNodeType() == node.ELEMENT_NODE )  // close the element
        {
        System.out.println("</" + node.getNodeName() + ">");
        }
    }

}

This Java program begins by reading the whole XML document into an in-memory tree. Then it traverses that tree, emitting element tags, attributes, and contents. It applies just one transformation to the XML source, prepending link targets to the elements that are the reviewable chunks of the docbase. In this example, these are paragraphs and list items. The code emits the XML tags themselves and all the attributes that come with each tag. Why? Remember that we’re depending on this XML to be HTML/CSS as well. This book, for example, uses CSS-enhanced tags like <h1 class="chapter"> and <p class="figure-title">. The transformed docbase has to preserve the tags with their attributes so that a browser can render the output as HTML, governed by CSS styles.

Let’s look at another way to do it. Example 9.4 inserts link targets using JavaScript to drive the MSXML parser. And in this example, the script is embedded in a web page.

Example 9-4. Inserting Element Anchors Using MSXML in an ASP Script

<%@ language = "jscript"%>

<%
var element = 1;

var doc = Server.CreateObject("microsoft.xmldom");

doc.load("c:\web\book.xml");

if (doc.parseError != "") 
    {
    Response.write(
        doc.parseError.reason + "," + 
        doc.parseError.line + "," + 
        doc.parseError.linepos + "," + 
        doc.parseError.srcText);        
    }

traverse(doc.documentElement);

function traverse(node)
    {
    if (node.nodeTypeString == "element") 
        {
        doStartTag(node);
        if (node.childNodes.length != null)
            {
            var i;
            for (i = 0; i < node.childNodes.length; i++)
                {
                traverse(node.childNodes.item(i));
                }
            }
        doEndTag(node);
        }
    else if (node.nodeTypeString == "text") 
        {
        Response.write(node.nodeValue);
        }
    else if (node.nodeTypeString == "entity") 
        {
        Response.write(node.nodeValue);
        }
    else
        Response.write ("node: " + node.nodeType);
    }

function doStartTag(node)
    {
    if ( 
        (node.nodeName == 'p')     || 
        (node.nodeName == 'li')    
        )
        {
        Response.write( '<a name="' + element++ + '">
'),
        }
    Response.write("<" + node.nodeName);
    doAttrs(node);
    Response.write(">
");
    }

function doEndTag(node)
    {
    Response.write("</" + node.nodeName + ">");
    }
    

function doAttrs(node)
    {
    if  ( node.attributes.length > 0 )
        {
        var i;
        for (i = 0; i < node.attributes.length; i++)
            {
            Response.write( " " + node.attributes.item(i).nodeName + "=" +
                            node.attributes.item(i).nodeValue);
            }
        }
    }

%>

Because this script runs in the Active Server Pages environment, it can do XML-to-HTML conversion on the fly. This is useful, but since on-the-fly conversion can be a slow process for a large document, the technique I actually used for this book instead generates HTML pages that are statically served, or just read into a browser using the file:// protocol

Inserting Comment Links into the Docbase

Comment links are the numbered links at the end of each paragraph and list element, as shown in Figure 9.2. The text of each link is the same sequence number encoded in the link targets we just made. The address lurking behind those few digits, though, includes all sorts of instrumentation:

The NNTP message ID associated with the element’s controlling header

The fourth paragraph under an <h2> header, for example, will encode that header’s message ID so that a comment posted by way of that paragraph’s comment link will nest under the NNTP message that represents that header.

The URL for the element

For this book, I processed the whole set of chapters as a single XML stream. But since it would be inconvenient to view the book as a single HTML document, I carved the HTML output into per-chapter chunks. So if paragraph 253 occurs in Chapter 7, its URL—for Version 2 of the draft—would be /groupware/v2/chap7.htm#253.

The complete text of the element

The comment form quotes this text so that reviewers can refer to it as they compose their comments. That form’s handler, which constructs and posts the NNTP message that is the comment, uses a leading fragment of the text as the Subject: header of the message.

When you click on the comment link, these items enable a CGI script to generate a form that quotes the section heading and paragraph from the book, collects comments about it, and posts a message containing all this information to the reviewers’ newsgroup. As we’ll see shortly, there’s an alternate implementation in which clicking the link launches a mail message that works in a similar way.

Creating the Discussion Framework

Controlling NNTP message and reference IDs is the key to this step. Newsreaders don’t transmit message IDs when they post. It’s normally the server’s job to create those IDs. It assigns a unique ID, such as [email protected]. But if you create a message that includes a Message-ID: header, the news server will honor that ID so long as it doesn’t conflict with any existing messages. Since you can’t use a newsreader to transmit such a message, how do you send it? We saw in Chapter 5 how to use telnet to drive an NNTP server “by hand.” There are several ways to automate the posting of a news message. Standard INN and most derived implementations—including Netscape’s Collabra Server, but not Microsoft’s NNTP Service—come with a command-line tool called inews. Given a file called msg.txt containing a set of NNTP headers and a message body, you can post a message like this:

inews -h msg.txt

A hybrid Web/NNTP application might use a CGI Perl script to pipe the data to an instance of inews, as shown in this Perl fragment:

open (INEWS, " | inews -h") or die "cannot open pipe to inews $!";
print INEWS $msg;

If you lack the inews tool, you can use one of a number of NNTP client modules. These are available for Perl, Python, Java, and doubtless many other languages that can use TCP/IP sockets. For Perl programmers, the hardest part is deciding which module to use. There are at least three available on the Comprehensive Perl Archive Network (CPAN, http://www.cpan.org/): Net::NNTP, LWP (which is nominally a web client but which also handles NNTP), and NNTPClient. Example 9.5 shows how to post a message using Net::NNTP.

Example 9-5. Posting a Message Using Net::NNTP

use Net::NNTP;

my $nntp = Net::NNTP->new('localhost'),

my @msg = (
"Newsgroups: groupware.v3
",
"Subject: (What's more, you can join components...
",
"From: [email protected]
",
"Message-ID: <925327035_159@local>
",
"References: <925327035_158@local>
",
"
",
"I almost wonder if you need somewhere to develop a metaphor
",
"analogous to "the pipeline." Maybe go reread the wonderful...
",
);

$nntp->post(@msg);

Newsgroup hierarchy arises from References: headers. This header, which is optional, can contain one or more message IDs. Newsreaders use this information to create hierarchical views of newsgroups. In our example, we want each message representing an <h1> docbase tag to omit the References: header. These chapter names will form the top level of the tree. Messages corresponding to all other docbase <hn> tags should carry a References: header that is the message ID of the closest ancestral (that is: <hn-1>) tag. A series of <h2> tags, for example, should all refer back to the nearest preceding <h1>; an <h3> following one of those <h2> tags should refer back to that <h2>. If the message ID of that <h2> is <925327035_158@local>, then the message shown in Example 9.5 will become a reply to it.

How should we form the message IDs? It’s a good idea to incorporate a timestamp so that this batch of autogenerated messages won’t conflict with any others. Since it only takes a second to generate the batch, the timestamp alone won’t guarantee uniqueness. So we’ll tack a sequence number onto the end of each ID. That yields IDs like the ones shown in Example 9.5.

Generating a Reviewable Docbase Using Perl and XML::Parser

I started with the Java DXP parser but switched immediately to Perl’s XML::Parser when it became available. You can use XML::Parser in a variety of modes, or “styles.” For example, the Tree style builds a complete in-memory representation of parsed XML content, which your script can then navigate and transform. The Stream style, which I’ll demonstrate here, doesn’t build an in-memory tree. Instead, it calls handlers, registered by your script, for three events—recognition of the beginning of a tag, of a tag’s content, or of the end of a tag. Here’s the skeleton of an XML::Parser script that uses the Stream style:

#! perl -w

use strict;
use XML::Parser;

my $xml = new XML::Parser (Style => 'Stream'),

$xml->parsefile("book.xml");

sub StartTag {}

sub Text {}

sub EndTag {}

The work of transforming this book’s XML source into a reviewable docbase is divided among the three handlers, StartTag( ), Text( ), and EndTag( ). Let’s walk through these one at a time.

The StartTag( ) handler

The parser calls StartTag( ) (see Example 9.6) when it recognizes a tag, passing the tag name explicitly, and a hash representation of the attributes in Perl’s default hash, %_. What’s that? It was news to me too. I was familiar with $_ , Perl’s default scalar, which magically stores the current line in a file-reading loop, or the current list element in a foreach loop. And I knew about @_, the default list that holds subroutine arguments. But I never suspected there might also be a default hash. Live and learn!

Example 9-6. The StartTag Handler

sub StartTag
  {
  my ($expat,$element) = @_;

  if (withinCommentableElement($expat,$element) )
    {
    print DOCBASE $_; 
    return;
    }

  $comment_chars = "";

  if ( isPreformattedElement ($element) )   # work around broken CSS in MSIE
    {   print DOCBASE "
<pre>"; }      

  if  ( $element  eq 'h1'     )             # new chapter
    {
    $counters->{chapter}++;                 # update counters
    $counters->{figure}  = 0;
    $counters->{listing} = 0;
    $counters->{table}   = 0;
                                            # start new HTML output file
    open (DOCBASE, ">./docbase/chap$counters->{chapter}.htm") 
        or die "cannot chap$counters->{chapter}.htm";

    print DOCBASE <<EOT;                    # emit boilerplate
<head>
<link rel="stylesheet" type="text/css" href="chap-style.css">
</head>
<body>
EOT
    }

  $tocListTags = '';

  if ( my $hdr = isHeader ($element) )      # do table-of-contents outline
    {
    $newTocLev = $hdr;
    $lastHdrElt = $element;
    $tocPreamble = "<a name="$counters->{element}">
<a href="chap$counters->{chapter}.htm#$counters->{element}"
 target="chap">
";
    if ($newTocLev > $tocStack[-1])
      {
      $tocListTags .= "<ul>
";
      push (@tocStack, $newTocLev);
      }
    else
      {
      while ($tocStack[-1] > $newTocLev )
        {
        $oldTocLev = pop @tocStack;
        $tocListTags .= "</ul>
";
        }
      }
    }

  if ( isCommentableElement ($element) )    # emit tag with jump target
    {
    print DOCBASE "
<a name=$counters->{element}>$_
";
    }
  else                                      # emit plain tag
    { 
    print DOCBASE "$_"; 
    }
  }

StartTag( ) begins by calling withinCommentableElement( ), a routine that tests whether the current element is contained within any of those to which comment links can attach.

sub withinCommentableElement
  {
  my ($expat,$element) = @_;
  my $within = 0;
  foreach my $elt ('p','li','h1','h2','h3','h4','h5')
    {
    if ( $expat->within_element($elt) )
      {
      $within = 1;
      }
    }
  return $within;
  }

Why do we need this routine? We want to accumulate complete paragraphs, list items, or headings for the quote that will be included in each comment link. Suppose a paragraph contains a <span>...</span>. We don’t want the StartTag( ) invocation that handles that tag to clear $comment_chars, the variable that’s accumulating the paragraph that contains this element. So if withinCommentable-Element( ) succeeds, StartTag just echoes the tag and returns.

When a new chapter appears in the stream, StartTag( ) increments the chapter counter, resets the figure and listing counters, and begins a new output file for that chapter’s generated HTML.

When a header appears, StartTag( ) records the HTML list syntax (<ul> tags) for the table of contents so that headers will indent properly. It also records link targets for these table-of-contents entries, so the links wrapped around the corresponding headers in the generated web page can jump to the right spot in the table of contents.

Finally, it writes the header tag to the generated web page. To headers, paragraphs, and list items—those elements that participate in the commenting system—it prepends link targets. The headers in the generated web pages, and the references in newsgroup messages, point to these targets in the docbase.

The Text( ) handler

The parser sends all the characters it finds between matched pairs of start and end tags to the Text( ) routine, shown in Example 9.7.

Example 9-7. The Text( ) Handler

sub Text
  {
  my ($expat) = @_;
  my $chars = $_;

  $comment_chars .= $chars;                       # save text for use by Endtag
                                                       
  if ($expat->current_element() eq 'h1')          # if new chapter
    {  
    $chars = "Chapter $counters->{chapter}: " . $chars; # announce its number
    }

  if ( my $level = isHeader ($expat->current_element) ) # if header
    {                                                   
    my ($prev) = $level-1;                        # compute parent level
    $prev = "h" . $prev;                          # form parent h tag
    $msg_id++;                                    # update msg_id counter
    my $s_msg_id = $timestamp . "_" . $msg_id;    # form message id
    $current_header = $chars;                     # remember current header's text
    $lastHdrs{$expat->current_element} = $s_msg_id;     # remember governing ID
    $lastHdrId = $s_msg_id;                       # remember last ID
    my $s_ref_id = "";
    if ($expat->current_element ne 'h1')          # if not an h1
      {   $s_ref_id = $lastHdrs{$prev} }          # make a References: header
    make_nntp_msg ( $s_msg_id, $s_ref_id,         # add an entry to nntp load file
      $counters->{chapter},$current_header);
    }

  if ( my $type = isFigureOrListingOrTable ($expat->current_element) )
    {
    $current_figttl = $chars;
    $counters->{$type}++;
    print DOCBASE "$type $counters->{chapter}-$counters->{$type}: ";
    }

  if ( isHeader ($expat->current_element))        # if header
    { 
    my $elt = $counters->{element};
    my $cnum = ($expat->current_element() eq 'h1')
             ?  "$counters->{chapter}: "
             : '';
    print TOC                                     # write table-of-contents entry
      "$tocListTags $tocPreamble <li>  
<span class="lev$newTocLev">$cnum $_</span></li></a>
";

    print DOCBASE                                 # write HTML doc fragment
      "<a href="toc.htm#$elt" target="toc">$chars</a>"; 
    }
  else
    { print DOCBASE $chars; }

  }

Note that this routine also runs in what you might think of as the interstitial spaces of the XML stream. For example:

<p>some text</p>  <- Text receives "some text"
                  <- Text receives two newlines
<p>more text</p>  <- Text receives "more text"

Why does the parser report the newlines in this apparent no-man’s land? Because there’s always an enclosing scope. There has to be an outermost tag pair—which could be <html>..</html> or <GroupwareBook>..</GroupwareBook>—enclosing the whole stream. So there really is no interstitial space.

Note how characters accumulate in $comment_chars( ), the variable that will ultimately produce the quoted version of each element that appears on the comment form. Like StartTag( ), Text( ) may be called within an <li> or <p> tag pair. This happens, for example, when the parser sees an inline element such as <span> or <strong>. So the Text( ) routine uses $comment_chars to accumulate characters across multiple calls. StartTag( ), as we’ve seen, resets $comment_chars to the empty string.

When Text( ) encounters an HTML header—that is, an element in the set h1..h6—it builds an entry in a file of NNTP messages that will be used to populate the newsgroup. It forms a message ID from the timestamp taken at the beginning of the run, plus a message counter. The characters received from the parser—that is, the contents of the header—go into the variable $current_header. It will later be used by the EndTag( ) routine to complete the table-of-contents entry for this element. The Text( ) routine also passes $current_header( ) to the make_nntp_msg( ) routine for use as the Subject: header of the NNTP message. make_nntp_msg( ) also receives the message ID created for this element. And for headers other than the top-level <h1>, it receives another message ID for use in the References: header. The Text( ) routine finds this ID in the %lastHdrs hashtable, which it also maintains. In the case of an <h3> header, for example, it looks up $lastHdrs{h2} to find the ID of the <h3>’s parent.

The Text( ) routine could post NNTP messages as it creates them, but instead it just builds a file that looks like Example 9.8:

Example 9-8. An NNTP Load File to Create the Discussion Framework

From: [email protected]
Message-ID: <925224566_151@local>
Subject: Chapter 8: Docbase Search
Newsgroups: groupware.v3
Content-type: text/html

Refer to <a href="http://localhost/.//chap8.htm#969">docbase</a>

From: [email protected]
Message-ID: <925224566_152@local>
Subject: A docbase's Web API
Newsgroups: groupware.v3
References: <925224566_151@local>
Content-type: text/html

Refer to <a href="http://localhost/.//chap8.htm#973">docbase</a>

From: [email protected]
Message-ID: <925224566_153@local>
Subject: URL namespace reengineering
Newsgroups: groupware.v3
References: <925224566_152@local>
Content-type: text/html

Refer to <a href="http://localhost/.//chap8.htm#978">docbase</a>

Why a standalone file of messages? It enables pipelined processing. For example, the first version of this generator was written in Java, and the NNTP loader was written in Perl. When I rebuilt the generator in Perl, there was no need to change the loader. The Perl generator only had to target the same interface—the file format shown in Example 9.8—as the Java version had. The loader itself, shown in Example 9.9, is very simple.

Example 9-9. An NNTP Message Loader

use Net::NNTP;

$nntp = Net::NNTP->new('localhost'),

my @msg = ();

open(F,"nntp_msgs") or die "cannot open nntp_msgs $!";
while (<F>)
  {
  push (@msg,$_);
  if ( m/^Refer to/ )
    {
    if (! $nntp->post($msg)) { die "cannot post" }
    @msg = ();
    }
  }
close F;

The Text( ) routine also takes care of autonumbering figures and listings by trapping elements of these types and inserting formatted numbers into two output streams—the table of contents and the docbase itself. Finally, it emits the characters received from the parser—wrapping a table-of-contents link around header text to create the other half of the table-of-contents/chapter cross-linkage.

The EndTag handler

The EndTag( ) routine (Example 9.10) adds the instrumented comment link to each commentable element. The link’s address encodes the information that the form-generating script passes to its handler, which in turn posts the comment to the news server using Net::NNTP. By the time the parser calls EndTag( ), all this information is available. Before emitting a </p> or </li> tag, it writes a link whose text is just an element number, but whose address is a muscular CGI call that passes the docbase name, chapter number, element number, the NNTP message ID for this element’s governing header, and the complete text of the element for quoting purposes. Finally, EndTag( ) increments the element counter.

Example 9-10. The EndTag( ) Handler

sub EndTag
  {
  my ($expat,$element) = @_;

  if ( withinCommentableElement($expat,$element) )
    {
    print DOCBASE $_;
    return;
    }

  if ( isCommentableElement($element) )                # need to add comment link
    {
    my $escaped_current_header =                       # escape current header
      escape ($current_header);

    my $encoded_chars = escape($comment_chars);        # escape current element

    if ($protocol eq 'mail')                           # email version
      {
      $comment_chars = "Chapter: $counters->{chapter}, Section: 
           $escaped_current_header, Para $counters->{element}: [$comment_chars]";
      print DOCBASE "
<span class="eltnum">
           <a href="mailto:[email protected]?subject=groupware.$version, 
           $escaped_current_header&body=$encoded_chars">" . 
           $counters->{element} . "</a></span>";
      }
    else                                               # nntp version
      {
      $comment_chars = "Chapter: $counters->{chapter}, Section: $current_header,
           Para $counters->{element}<p>$comment_chars";
      print DOCBASE "
<span class="eltnum">
           <a href="http://$server/$cgi_path/comment.pl?
           docbase=groupware.$version&chapnum=$counters->{chapter}&
           elt=$counters->{element}&fragment=$encoded_chars&
           id=$lastHdrId">" . $counters->{element} . "</a></span>";
      }
    }

  if ( my $type = isFigureOrListingOrTable($element) ) # update table-of-contents
    {
    my $tocElt = "<a name="$counters->{element}"><a href="chap$counters->
         {chapter}.htm#$counters->{element}" target="chap"><li>
     $type $counters->{chapter}-$counters->{$type}: $current_figttl</li></a>
";   
    }

  print DOCBASE $_;                                    # emit tag

  if ( isPreformattedElement ($element) )              # CSS workaround
    {   print DOCBASE "</pre>"; }

  if ( isCommentableElement ($element) )               # update element counter
    {   $counters->{element}++; }

  }
                        
                        
                        
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.209.95