Transforming an XML Repository into Reviewable Web Pages

I began writing this book in HTML, then switched midstream to XML. Using terms from Chapter 5, I converted a docbase whose repository format and delivery format were both the same stream of HTML into a docbase whose repository format is XML and whose delivery format is HTML. Here are the lessons I learned when I did that.

You Can Easily Convert HTML to Equivalent XML

XML doesn’t have to involve complex document type definitions (DTDs) written in weird syntax that’s hard to understand and use. Of course, there are good reasons to use DTDs, but the inventors of XML wisely chose to make them optional. As a result, the initial conversion of my HTML manuscript to XML was a trivial exercise that took just a few hours. There were just three rules I had to apply:

  • Close all tags.

  • Quote all attributes.

  • Escape ampersands.

I used keystroke macros in my text editor to add end tags to <p>, <li>, and <img> elements. To close an empty tag such as <img>—that is, a tag that has no content other than its attributes—you need only precede the trailing angle bracket with a forward slash, like this:

<img src="fig2.gif"/>

I also used search-and-replace to escape the ampersand (&), which in XML is written as &amp;. This applies to invidual ampersands as well as those that introduce HTML entities such as &lt; and &gt;, which represent < and >.

This XML-ized flavor of HTML now has its own acronym: XHTML (Extensible HyperText Markup Language, And there are XML editors, including SoftQuad’s XMetaL ( and the W3C’s test-bed browser, Amaya (, that support XHTML. You can use these to convert HTML to its equivalent XHTML or to write XHTML directly.

XML Means No More Custom Parsing Code

The docbase applications we’ve seen thus far required trivial kinds of parsing. If you only need to extract <meta> tag information from a docbase, as does Docbase::Docbase::getMetadata( ) in Chapter 6, it’s probably overkill to use an XML parser. Simple Perl scripts work well for simple parsing problems. But to add the feedback mechanisms we’re building here requires a deeper and more complete understanding of the docbase’s structure.

If you find yourself writing code to parse complex structures, you should rethink your approach. XML parsers that can do this for you are freely available (see, for example, These parsers won’t write your application for you. It’s up to your code to do something useful with the tags, attributes, and text chunks recognized by the parser. But the recognition itself can and should reside in an off-the-shelf component that you plug into your application

Perl’s XML::Parser Module Is Really Useful

Parsers based on C and Java were available long before Perl’s XML::Parser module was. They worked well, but I wasn’t very productive with these tools. When you add value to an XML repository, what matters most is rapid development and capable text processing. These are of course two of the outstanding qualities of Perl. The advent of a Perl-based XML parser rewrote the productivity equation for me.

XML and HTML Can Fruitfully Coincide

One of the exciting things about XML is that you can invent your own tags. So you can, for example, replace HTML’s meager set of six header tags—<h1> through <h6>—with meaningful names like <section>, <chapter>, and <figure>. But just because you can doesn’t mean you have to. In an HTML-oriented world, you’ll need to convert these logical constructs into the kinds of tags that browsers can render. Standards that address this problem include Document Style Semantics and Specification Language (DSSSL) and—the current favorite—Extensible Stylesheet Language (XSL). Even when tools that work with these standards become widespread and standard, they might be overkill for simple applications. Consider this approach:

<h1 class="chapter">Conferencing Applications</h1>
<p class="figure-title">Figure 9-1: ....</p>

These are HTML constructs augmented with CSS attributes. But if the docbase containing these constructs passes muster when you run it through an XML parser—as it will if you close all tags, quote all attributes, and escape ampersands—then it is also by definition an XML docbase. In this situation, you can merge the repository and delivery formats or diverge them.

In the merged case, the “XMLness” of the docbase ensures that you can search it or transform it, while its “HTMLness” means that browsers need no XML-to-HTML translation in order to render the docbase. CSS tags that provide purely structural markers when you work with the XML aspect of the docbase double as stylistic markers when you work with its HTML aspect.

In the diverged case, the “XMLness” of the docbase is a springboard from which to launch enhanced renderings. These can retain the “HTMLness” (and “CSSness”) of the repository while adding the kinds of features that should be derived from the docbase rather than encoded in it.

For example, it was straightforward to generate a table of contents from this book’s repository and to cross-link the headers in the generated table of contents with the headers in the generated chapters. It was also straightforward to number the chapters and figures during the translation from repository to delivery format. Figure 9.5 shows the table-of-contents view of this book side by side with a piece of a chapter.

Generated table of contents

Figure 9-5. Generated table of contents

The two panes are linked in both directions. Clicking a link in the table-of-contents pane synchronizes the chapter pane to the indicated chapter and section. Clicking a chapter heading or subheading does the same thing to the table-of-contents pane. A script can do this cross-linking because even though the book’s source is “only” HTML/CSS, it is also XML, structured so that it’s easy to pick out chapter headings, listings, and figures. The same code uses this information again to build the newsgroup outline, which is just another view of the table of contents shown in the left pane of Figure 9.5.

A Transitional Approach to XML Authoring Has Near-term Value

This way of combining HTML, CSS, and XML is a transitional strategy. I hope it won’t be needed once browsers that render XML directly (subject to CSS or XSL styles) have become widespread and standard, along with tools that help us write XML. But even in “Internet time” these developments sometimes take longer than we’d like. Meanwhile it can be useful to leverage the ability of today’s browser’s to directly render XML-ized HTML.

In order to build a reviewable docbase from XML sources, somebody has to produce those XML sources. In the case of this book, that somebody was me. As I’ve already mentioned, it was a short hop from the HTML stream I began with to the XML stream I switched to midway through the book. Of course, I’ve been writing in tagged-text formats for years. People who haven’t, and who depend on WYSIWYG HTML text editors, can use one of the emerging breed of XHTML editors.

Remember, too, that you can sometimes control docbase inputs using an XML-ized input template. Because the Docbase system can do that, it’s an example of an XML authoring application that can reliably produce well-formed and valid XML, using the Docbase::Input module we saw in Chapter 6.

DTDs Don’t Have to Be Complex

Even though it’s easy for me to write well-formed XML using just a text editor, what I produce this way isn’t always valid XML. What’s the difference between well-formed XML and valid XML? Well-formed XML merely conforms to syntactic rules: all tags properly closed, all attributes quoted. Valid XML is well-formed XML that also conforms to a DTD that spells out the elements and structures that can appear in a document. Example 9.1 shows the DTD that I used for this book.

Example 9-1. An HTML/CSS-oriented Document Type Definition

<!DOCTYPE GroupwareBook [

 <!ELEMENT GroupwareBook
  (h1 | h2 | h3 | h4 | h5 | p | ul | ol | table | div)+>

 <!ELEMENT table   (tr | th)+>
 <!ELEMENT tr      (td)+>
 <!ELEMENT th      (td)+>
 <!ELEMENT ul      (ul | li | a)+>
 <!ELEMENT ol      (ol | li | a)+>
 <!ELEMENT img     EMPTY>
 <!ELEMENT a       (#PCDATA)>
 <!ELEMENT hr      EMPTY>
 <!ELEMENT td      (#PCDATA | img | a | ul | br | b | u |
   i | center | p | blockquote | tt | hr)*>
 <!ELEMENT p       (#PCDATA | i | u | b | span | center | a | tt)*>
 <!ELEMENT li      (#PCDATA | a | p | b | span | tt | i)*>
 <!ELEMENT h1      (#PCDATA | a)*>
 <!ELEMENT h2      (#PCDATA | a | tt)*>
 <!ELEMENT h3      (#PCDATA | a)*>
 <!ELEMENT h4      (#PCDATA | a)*>
 <!ELEMENT h5      (#PCDATA | a)*>
 <!ELEMENT i       (#PCDATA | a | p)*>
 <!ELEMENT center  (#PCDATA | p | a | table)*>
 <!ELEMENT br      (#PCDATA | a)*>
 <!ELEMENT tt      (#PCDATA | a)*>
 <!ELEMENT u       (#PCDATA | a)*>
 <!ELEMENT blockquote (#PCDATA | a | i)*>
 <!ELEMENT b       (#PCDATA | a)*>
 <!ELEMENT div     (#PCDATA | p | a | b | table | ul)*>
 <!ELEMENT span    (#PCDATA | a)*>

<!ATTLIST h1     
    class (chapter) #REQUIRED

<!ATTLIST p      
    class ( listing-title   | 
            figure-title    | 
            table-title     | 
            footnote        |
            quotation       |
            UsageTipTitle   |
            ) #IMPLIED

<!ATTLIST a      

<!ATTLIST hr      
    width CDATA #IMPLIED

<!ATTLIST table  
    class         (inline-image) #IMPLIED
    border        CDATA #IMPLIED
    cellspacing   CDATA #IMPLIED
    cellpadding   CDATA #IMPLIED
    cols          CDATA #IMPLIED
    width         CDATA #IMPLIED

    align         CDATA #IMPLIED
    valign        CDATA #IMPLIED
    colspan       CDATA #IMPLIED

<!ATTLIST div    
    class     ( inline-code     |
                inline-listing  |
                inline-table    |
                )   #REQUIRED
    align     CDATA #IMPLIED

<!ATTLIST img    
    src      CDATA #REQUIRED

<!ATTLIST span 
    class ( footnote|
            )       #REQUIRED


This DTD specifies the (quite small) subset of HTML that I used in this book. Because I’d already written part of the book when I switched to XML, the DTD wasn’t a prescription but rather a codification of the HTML idioms that I had already chosen to use. This seemingly backward approach might horrify an SGML purist, but it should delight an XML zealot. The whole raison d’etre of XML, after all, is to bridge two very different cultures: the primordial tag soup of HTML and the structural rigor of SGML. Even though I’m highly attuned to structured-text disciplines, I had always been intimidated by the thought of creating a DTD from scratch. In practice it wasn’t a bad idea to just start writing, using whatever HTML idioms came easily to hand, and then discover and refine the DTD that was implicit in my usage of HTML.

The DTD’s DOCTYPE is GroupwareBook. That means a document that is an instance of this DTD will have this structure:

... the book ...

The first ELEMENT declaration lists the top-level HTML tags that can appear within the outermost tag pair. The remaining ELEMENT declarations list each of the HTML tags in the book and define the elements (if any) that can nest within them.

The structure defined by this DTD is very flat, nothing like the rich structures defined by SGML DTDs such as DocBook ( and Text Encoding Initiative (TEI) ( These DTDs are excellent tools, and you should certainly consider using them when starting a major new documentation project. But what if you’re sitting on a pile of existing HTML and have users who are writing more of it every day? A simple, flat DTD such as this one can be the path of least resistance to a version of the HTML that you can manage using parser-based software.

The ATTLIST declarations enumerate each attribute that’s used in the book. Again there’s nothing complicated here, though the DTD does enforce some simple rules. For example, the <div> tag’s class attribute is required. Further, the value of that attribute must be one of the listed options.

Using a validating parser

Perl’s XML::Parser is not a validating XML parser. It checks only for well-formedness, not for conformance to a DTD. But there are plenty of validating parsers around. I started with DataChannel’s DXP, a Java-based validating parser that has now evolved into XJ Parser, a freely available tool jointly developed by Microsoft and DataChannel ( More recently I’ve used MSXML, a component that’s built into Internet Explorer. I use these validating parsers to keep my book’s XML sources in line with its DTD and use XML::Parser for the script that transforms those well-formed and valid XML files into a reviewable docbase.

Why does validity matter? Well-formed XML isn’t always what you think it is. For example, the DXP parser showed me at one point that I had a <div> tag in Chapter 2, whose matching </div> tag didn’t occur until Chapter 7. Clearly I had intended to write a matched tag pair in each chapter. To a nonvalidating parser, though, two wrongs can make a right. Without consulting a DTD, it can’t know that a <div> tag—as I meant to use it—shouldn’t contain whole chapters or subheads. Consider this fragment that omits a </div>:

<!-- </div> This tag accidentally omitted -->

<h3>Groups need privacy too</h3>

Here’s how the MSXML parser handles that situation:

Element content is invalid according to the DTD/Schema.
Expecting: #PCDATA, p, a, b, table, ul.
 845 5 <h3>Groups need privacy too</h3>

The <h3> is invalid because, absent a </div>, the still-open <div> would contain an h3—which the DTD says it cannot. When the parser sees an element that the DTD won’t permit in the current context, it tells you what would have been allowed. It flags the line and column of the erroneous element and echoes the text on that line.

Although you need to install Internet Explorer to get MSXML, you don’t need to use it within the browser. You can also run the parser from the command line on Windows 95/98/NT, as Example 9.2 shows.

Example 9-2. Using the MSXML Parser from the Windows Command Line

var doc = new ActiveXObject("microsoft.xmldom");
if (doc.parseError != "") 
        WScript.echo(doc.parseError.reason, doc.parseError.line,
                doc.parseError.linepos, doc.parseError.srcText);                

If this code lives in a file called validate.js, you can use the Windows Scripting Host ( to run this script from the Win95/98 or NT command line like this:

cscript validate.js

Managing document structure: declarative versus procedural methods

If you’re retrofitting a DTD to an HTML stream, as I was, you can use a validating parser to incrementally develop the DTD. When you start from scratch, no element is allowed until you declare it and define the context in which it can appear. Should <h1 class="chapter"> be valid inside a <div> element? In standard HTML it is. In my DTD it isn’t, because I reserved <div> exclusively as a container for figures and listings. Neither is plain <h1> valid. I reserve it exclusively for chapter heads, and its ATTLIST declaration in the DTD requires a class attribute.

Example 9.1 is full of compromises. A pure XML approach might use <figure-title> rather than <div class="figure-title">. In that case, the DTD could enforce a more complete definition of a figure, for example:

<!ELEMENT figure (figure-title, figure-body, figure-caption)>

But how are you going to display this construct? For the current installed base of browsers, you’d have to write parser-based code to translate these elements into HTML, possibly augmented with CSS. Both 5.x browsers can now associate XML constructs directly with CSS styles but at the time of writing (summer 1999) this remains an experimental capability. When it matures, and when these browsers substantially displace the 3.x/4.x browsers, a pure XML approach will become possible. Until then, a hybrid HTML/CSS/XML strategy can help you weather the transition.

There are trade-offs, to be sure. In the hybrid case, the DTD can’t enforce the previous definition of a figure. At best, it can enforce an enumeration of the attributes used with the <div> element and constrain where that element may appear and what it may contain. It’s possible to enforce a richer definition of a figure. But you can’t do it declaratively with the DTD. You have to do it procedurally, in a parser-based application.

Which approach is best? Neither is inherently right or wrong. This book isn’t a particularly complex document, so it made sense to trade structure-declaring power for rendering convenience. This trade-off can also make sense for lots of routine business documents, such as those you can create and manage using the Docbase system we explored in Chapter 6 and Chapter 7. If I were writing a Boeing 777 manual, on the other hand, I’d want to trade rendering convenience for all the declarative power that the DTD could possibly provide. What’s great about XML is that you can locate an application of it anywere along a continuum. Some applications profit by adding a bit of rigor to HTML, some by using the maximum amount of rigor that XML can provide. XML embraces both approaches and everything in between.

Final Observations on the Transitional HTML/CSS/XML Approach

Here are some final points to keep in mind about the transitional strategy I’ve sketched here.

You can idealize the installed base

The CSS-oriented approach aims squarely at the installed base of 4.x browsers. In that context, the XML-to-HTML translation affords an opportunity to smooth over differences between the Microsoft and Netscape implementations of CSS1. For example, I tagged listings using <div class="inline-listing"> and defined this CSS style:

        font-family: courier;
        font-size: 10pt;
        white-space: pre;  /* MSIE doesn't do this */

But it only works in Communicator. MSIE doesn’t honor the request to treat white-space in a preformatted fashion.[10] One solution would have been to use the <pre> tag in the HTML/XML source, making both browsers preserve white space in listings. But I didn’t want a temporary quirk of one browser to force me to use a soon-to-be-obselete tag, and I didn’t want to have to revisit the XML source and alter those tags when the quirk eventually went away. So instead I tweaked the HTML generator to wrap a <pre>..</pre> tag pair around <div class="inline-listing">..</div>. This isn’t a perfect solution. But it enables me to mark up the text as if white-space: pre were supported by both browsers. Once it is, I can remove the <pre>..</pre> from the processing script and rely on CSS.

XML-ized HTML is a lot more valuable than plain HTML

No one yet knows for sure how XML rendering is ultimately going to work. By the summer of 1999, browser support for XSL and document object models was a work in progress. XML parsers, however, were widely available. No matter which way the wind blows, you’ll be better prepared if your documents are well formed and valid. Think of it as insurance. Even if in the short term you rely on HTML/CSS-based presentation services, you’ll know that you can mechanically transform content—if it’s well-formed and valid XML—when it comes time to exploit next-generation technologies.

DTDs can’t do everything

Tim Bray, coeditor of the XML specification, points out that there are always going to be limits to the structural rules that you can enforce declaratively using a DTD. For example, a Docbase application might want to constrain the set of values allowed in a <meta> tag, such as <META NAME="company" CONTENT= "Microsoft">, based on a database lookup. Clearly you won’t want to enumerate an entire database column in a DTD declaration. A parser-based application is going to have to do the lookup in order to validate this kind of document.

[10] In fairness, despite this glitch, MSIE is on the whole a far better implementation of CSS than Communicator. Netscape embraced CSS late and reluctantly and has been playing catch-up.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.