Storing Docbase Records

In the Docbase system described here and in Chapter 7, the repository format coincides with the delivery format. The same set of HTML pages serves both purposes. This differs from the BYTE Magazine docbase we explored in Chapter 5; in that case, a translator read the repository format and wrote deliverable pages.

Why use one versus the other? The BYTE docbase had a fairly complex format but was batch-oriented and maintained by a single production expert who exported material from QuarkXPress and then massaged it to meet a detailed repository specification. There was no need to preview individual pages or validate input interactively, and although a tool could have provided these features, it would have been costly to build and maintain.

The Virtual Press Room, by contrast, had a relatively simple format and was built interactively by many untrained users. These users required an authoring tool that did validate and preview the information they supplied. Because the format was simple, that tool was cheap to build and maintain. Since the preview pages had to be produced immediately, it was convenient to just store them as is.

In the Docbase system, the deliverable HTML pages can be XML pages too, if you format the docbase template as XML. When the two formats coincide, HTML pages are much more manageable than they otherwise would be. The Virtual Press Room was, like the BYTE docbase, a pre-XML-era invention. Had I built it in 1999 rather than 1995, I’d have exploited XML, as I’ll demonstrate here.

When a user enters a new record in the ProductAnalysis docbase, Docbase::Input( ) interpolates the validated input into the same docbase template used for the preview. That template might be plain HTML, perhaps augmented with CSS styling. But as Example 6.8 shows, it can also conform to the rules for well-formed XML. As we’ll see again in Chapter 9, those rules are minimal. In this case, they simply require that all tags must be closed and all attributes quoted.

Example 6-8. An HTML/XML Docbase Record Template

<meta name="company" content="[company]"/>
<meta name="product" content="[product]"/>
<meta name="analyst" content="[analyst]"/>
<meta name="duedate" content="[duedate]"/>
<title>[company], [product], [title]</title>
<link rel="stylesheet" type="text/css" href="../../../Docbase/ProductAnalysis/style.css"/>

<!-- navcontrols --> <!-- navigation controls go here -->

<h1>[company] / [product]</h1>

<table border="1" cellpadding="4">

<td align="right" valign="top" class="label">Date</td>
<td align="left" class="duedate">[duedate]</td>

<td align="right" valign="top" class="label">Analyst</td>
<td align="left" class="analyst">[analyst]</td>

<td align="right" valign="top" class="label">Title</td>
<td align="left" class="title">[title]</td>

<td align="right" valign="top" class="label">Summary</td>
<td align="left" class="summary">[summary]</td>

<td align="right" valign="top" class="label">Full Report</td>
<td align="left" class="fulltext">[fulltext]</td>

<td align="right" valign="top" class="label">Contact Info</td>
<td align="left" class="contact">[contact]</td>



This template is used twice—first to create the preview, as we’ve already seen, and again to create the final record stored in the docbase.

The combination of HTML, CSS, and XML shown here is a transitional strategy. You could, instead, write a pure XML template like this:


The problem with this approach is that, for most browsers, you’ll end up with a repository format that doesn’t coincide with a delivery format. Internet Explorer 5.0 can associate XML tags with CSS or extensible stylesheet language (XSL) styles and thus render a page of XML as it would render a page of HTML. So can the beta version of Navigator 5.0. But this is a new capability that’s not yet universally deployed and won’t be for a while. So in practice, to support the installed base of browsers, you’d need another step to translate between repository and delivery formats.

The middle-ground approach shown in Example 6.8, which we’ll see again in Chapter 9, makes ordinary CSS class attributes do double duty. In the presence of a CSS style sheet, these attributes exert stylistic control over the docbase record. That control can be as detailed as your tagging will support—you could even assign a unique style to each field of the record. What’s more, styles obey inheritance rules, so styles assigned to a class attached to the <body> tag, or to a <table> tag, will ripple down through these structures unless explicitly overridden at lower levels. Well, in theory that’s what happens. In practice neither the Netscape nor the Microsoft browser currently implements all of CSS1, and you’ll run into the usual headaches when you try to figure out which features, and combinations of features, work reliably in both.

Another Use for CSS Tags

Flaky CSS implementations don’t detract at all from another role played by the class attribute. It is, fundamentally, a selector that operates on a document and returns a subset of its elements. Normally it’s a CSS-aware application (e.g., your browser) that does the selection in order to apply a style. But any other application can use the selectors too. Suppose you want to create a view of the docbase that presents report summaries containing a search term. In SQL terms, you’d like to issue the query:

select summary from docbase where summary like '%LDAP%'

Example 6.9 demonstrates a filter, called xml-grep, that reads one of the HTML/CSS/XML files in this docbase and performs the same query.

Example 6-9. A Docbase Query Based on CSS Tags

#   usage: xml-grep FILENAME TAG PATTERN
# example: xml-grep 000127.htm summary LDAP

use XML::Parser;

my $xml = new XML::Parser (Style => 'Stream'),  

$xml->parsefile($ARGV[0]);         # parse the file

sub StartTag {}                    # not needed here

sub EndTag {}                      # not needed here

sub Text
  my $expat = shift;
  if  (
      $expat->current_element eq 'td'        and  # table cell
      $_{class}               eq $ARGV[1]    and  # of class 'summary'
      m/$ARGV[2]/                                 # matching LDAP
    {   print "$_{class}: $_
";    }             # found a hit

This script expects three arguments: a filename, a class attribute, and a search string. It’s a whole lot slower than grep. But it’s more flexible, because it will match, for example, either of these patterns:

<td class="summary" width="20%">...</td>

<td valign="top" class="summary" align="left" colspan="2">...</td>

What’s more, this approach can deal with inheritance in the same way that CSS display processors do. For example, the analyst field might not always be immediately contained within a cell of an HTML table. Suppose that inside that cell, the name is wrapped up in link syntax, like this:

<td class="analyst"><a href="mailto:[email protected]">
Jon Udell</a></td>

We can still capture my name like this:

if  (
    $expat->within_element('td')                and  # inside a table cell
    $last_seen_class_attr        eq $ARGV[1]    and  # class="analyst"
    m/$ARGV[2]/                                      # match "Jon Udell"

If we saved the value of the last-seen class attribute as $last_seen_class_attr, then this fragment—which runs in the context of the <a href> tag—will succeed. A line-oriented grep can’t do this. But an XML query that understands the hierarchy of an attributed docbase can find things that are nested in other things. Several formal query languages are proposed for XML, notably XQL ( and XML-QL ( Even without a general-purpose XML query language, though, you can see that it’s not hard to write parser-enabled code to do simple queries.

Transforming Docbase Records

The XML nature of the docbase records created by the template in Example 6.8 solves another important problem too. When I managed the Virtual Press Room, I sometimes had to make wholesale changes to the docbase. That was never a problem with the BYTE docbase, because its “object code” was routinely “compiled” from its “source code.” But the VPR’s “object code” was its “source code,” and there was no “compiler” in the same sense.

Because the VPR’s HTML pages were machine written, they exhibited regular patterns that Perl scripts could latch onto and use to make systematic transformations. But the pages weren’t trivially rewritable. Creating those scripts was feasible but was a time-consuming and ultimately wasteful exercise. XML means never having to waste your time writing custom parsing code.

Docbases need to evolve. Inevitably you’ll run into situations that require wholesale rewriting of a set of records. The XML discipline makes that kind of rewrite vastly simpler than it otherwise would be. That’s a huge bonus for a manager of semistructured information.

Using HTML <meta> Tags

HTML’s <meta> tag has for years provided a way to make the header of a web page behave much like the header of an email or news message. You can use the <meta> tag to tuck a set of name/value pairs into a document header. In the long run, XML may obselete this way of maintaining a structured header inside a web page. But for the near future, it’s a really useful technique. Like email headers, these kinds of web-page headers are easy to parse and manipulate, using a variety of tools. Because the <meta> tags in Example 6.1 are well-formed XML, any XML parser can work with them. But as we’ll see in the next chapter, sometimes that can be overkill. It’s faster and easier to deal with a simple pattern like this one using Perl’s native regular-expression engine.

In Chapter 7, we’ll use the meta-tagged header in the docbase record to build indexes that enable several modes of navigation. In Chapter 8, we’ll see how full-text indexers can automatically recognize the meta-tagged header and use it to support field-level as well as full-text search of the docbase. That’s a powerful capability, but one that’s seldom used. Why? It requires a tagging discipline that many web archives lack. By doing that tagging automatically, the Docbase system creates potential value. A smart navigational system is one way to actualize that potential; a smart search system is another.

Note that some of the fields defined in Example 6.8 with <meta> tags duplicate fields governed by CSS class attributes. Why do it both ways? Sometimes you just need to scan for indexable fields, as we’ll be doing in the next chapter, and then it’s handy to have a nice neat header tucked into the top of every docbase record. Sometimes you need to do a wholesale transformation of the docbase, in which case you’ll want to deal with XML elements rather than simple text patterns. There’s more than one way to do it!

Mechanics of Docbase Record Storage

Now let’s see how a record, having been previewed and submitted by a user, enters the docbase. The preview is hardwired to a common script,, shown in Example 6.10.

Example 6-10. The Script

#!/usr/bin/perl -w

use strict;
use TinyCGI;
my $tc = TinyCGI->new();
print $tc->printHeader;
my $vars = $tc->readParse();

use Docbase::Docbase;
my $db = Docbase::Docbase->new($vars->{app});

use Docbase::Input;
my $di = Docbase::Input->new($db);


It’s brief, needing only to pass a hashtable of CGI variables to the Docbase::Input method writeDoc( ), shown in Example 6.11.

Example 6-11. The writeDoc Method

sub writeDoc
  my ($self,$vars) = @_;
  my $app = $self->{app};
  my $cgi_absolute = $self->{docbase_cgi_absolute};
  my $web_absolute = $self->{docbase_web_absolute};
  my $db_template =                               # make template name
  my $content .=                                  # interpolate vars into template
  my $docnum =                                    # get next record number
  my $docfile =                                   # make record's filename
  if ( open(F,">$docfile") )
    print F $content;                             # store record
    close F;
    print "<br>Done. Your reference number is $docnum
    print "<p>cannot open docfile $docfile";

The writeDoc( ) method is also brief. It uses _fillTemplate( ) again to interpolate form variables into the record template, asks for the next available record number, creates a file named for that record number, and writes the record to the file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.