Implementation Notes

The following sections describe the steps taken to build the product-manual production system.

Converting the Hard-Copy Manual to XML

Translating a printed document into XML is a lot more involved than just retyping it into a word processor. XML documents are meant to capture not only the content of a document, but also information about its structure. A typical document in a word processor might have paragraph styles, heading styles, and so on that affect how the text is displayed. This is fine, if the only goal is to produce good-looking output for viewing by people.

A well-designed XML document, however, contains much more information than can be conveyed by simple formatting. For instance, take the following note:

Note

XML documents are composed of both structure and content.


On the page, this note appears as a gray box with black text, with a small white box in the upper-left corner that contains the header. This formatting serves to set the note apart from the rest of the text, but it conveys no information about the purpose of the formatted text. If we were to convert this book into XML, however, the note above would be coded as a special XML tag:

<note>XML documents are composed of both structure and content.</note>

This XML fragment now conveys the actual meaning of the text. Although it doesn't indicate how the text should be displayed, it is a relatively simple matter to transform the original XML document into an appropriate format for the display device.

Building the DTD

The manual we will be converting into XML is for a simple videocassette player (VCP). The scanned version of the original paper manual is available in its entirety on the CD. Converting a basically unstructured document into a structured one is more art than science, but there are some basic steps that make the process smoother.

Note

Before you invest a lot of time in building a custom DTD, it is well worth the effort to shop around on the Internet for an existing DTD that will meet your needs. Several sites on the Web attempt to catalog and collect XML DTDs, so be sure to check the resource links listed on this book's Web site. Even if an off-the-shelf DTD doesn't exactly meet your needs, it may help get you started in the right direction.


The final goal of this step is to develop an XML document type definition (DTD) that expresses the structure of valid product manuals. Although it is tempting to go directly from the printed manual to the DTD, in reality it is much easier to convert a paper document into an XML document, then build a DTD for the resulting XML. There are automated tools that simplify this process by building a DTD that matches a well-formed XML document, and we will be using one of them after we have the entire document converted.

The actual process of encoding the printed document in XML format is straightforward, though tedious. There are a few things to keep in mind while encoding the document that can reduce problems later, as described in the following sections.

Isolate the Smallest, Indivisible Elements of a Document

Reduce the document into its atomic elements (images, paragraphs, titles, boxed notes, and so on). After these elements have been isolated, it becomes easier to classify them and group them within the XML document.

Categorize Each Document Element

Looking at each atomic piece of a document, determine whether it serves the same purpose as any of the other pieces. If not, it is necessary to create a new XML element type to contain it. Otherwise, encode it using an existing element type.

Group Related Elements

Frequently in an unstructured document, related material will be scattered throughout a section. Whenever possible, related elements should be grouped into a larger, containing element in XML. For instance, footnotes might appear throughout a technical paper at the bottom of each page. In XML, it would make more sense to place all footnotes in a single container element and include an XLink reference to the specific footnote within the structured document text. This yields a cleaner document format and simplifies transformation scripts.

Use Standard Structured Document Idioms Whenever Possible

Some types of content are so common that the markup used to represent them has become idiomatic. For instance, general-purpose lists and tables are almost always encoded using the basic HTML <ol>, <ul>, and <table> tags. Not only will you save yourself the trouble of re-inventing the wheel, but you also will simplify the task of transforming your XML into HTML.

Collapse Duplicated Material

If the same boilerplate material appears in more than one location in your document, you should consider collapsing all the occurrences into a single element. This action will make your document more maintainable, smaller, and more coherent. If you need to incorporate the same material in multiple locations, use XLink to create pointers to the single valid instance of the data.

Exclude Presentational Elements

Drawing the line between true content and presentation can be tricky at times. Here are some examples of presentational elements that should not be included in a content document:

  • List or outline numbers XSLT provides robust facilities for generating these automatically. Rather than include them in content, then worry about keeping them up-to-date, just remove them and use the order of the elements in the document to maintain relative positioning.

  • Repeated labels or constant text For example, a long list of addresses in a print document might include labels such as City:, State:, and Zip: for each entry. The correct way to represent this in XML is to create an <address> element that includes these values, and then provide the static text labels as part of the transformation step.

  • Hard-coded table of contents or index Rather than build these manually and keep updating them as the document changes, use XSLT to generate them dynamically. Not only will this approach be easier, but it will make the final document more accurate.

Structural Versus Content Elements

The types of tags we will require in our document fall into two basic classes: structural and content (text). Structural elements exist to organize the document into logical units. Content elements contain the actual character data of the document. The difference between these two types of elements becomes more obvious as we transform the raw XML document into human-readable form for both the Web and print media. Table 18.1 lists a few structural and content elements for comparison.

Table 18.1. Structural and Text Element Examples
Structural Elements Content Elements
body p
section title
table img

As a rule of thumb, structural elements are almost never leaf nodes and text elements usually are. Structural elements primarily exist for the purpose of containing and organizing other elements, whereas text elements exist to contain character data.

The completed document is called VCPManual.xml. After the original printed manual has been completely encoded in XML, the next step is to create a document type definition for it. The DTD is important because it is what defines what is and what is not a valid product manual (at least according to our system). By formalizing the document structure, we can build generic transforms that should theoretically work not only with the current manual in production, but with any future manuals we would happen to create as well. We can make assumptions about the order of elements within the document, which elements can be nested, and where character data can appear.

To build our DTD, we could manually review the entire document we just created, deduce the relationships between the various elements, and write the entire DTD by hand. This approach would yield a correct DTD, but it is somewhat time-consuming. Several tools are available on the Internet for generating a DTD from an invalid XML document. To build the example, I used the DTDGenerator tool that is available with the SAXON XSLT processor. This tool analyzes the structure of a document and creates a DTD that can be used to validate the document given. Its usage is:

java DTDGenerator inputdocument.xml > output.dtd
						

The DTD generated by a tool such as this will not necessarily be completely correct. Because it must infer relationships between elements from a single document, the resulting DTD may tend to be more restrictive than it should be. The automatically generated DTD should be treated as a starting point, and should be reviewed and modified to express the actual rules of the document type in question. A complete and correct DTD can be used by WYSIWYG XML editors (such as SoftQuad's XMetal product) to create new XML documents with a simple drag-and-drop interface. One of the requirements for this project is to use as much off-the-shelf technology as possible, so building a complete and accurate DTD is essential.

Transforming for the Web

There are several approaches for displaying XML data on the Web. Before selecting a particular solution, you should take a few factors into account.

Who Will View the Content?

One of the major problems with developing Web content and applications today is the incredible diversity of consumers of Web content. Millions of users, hundreds of languages, several browsers, and variable bandwidth make delivering useful, high-quality content to every user a difficult problem. In this case, we can't make any assumptions about our audience. Mass market consumer electronics could be purchased by anyone, so we need to make sure that we don't arbitrarily exclude anyone from viewing our site.

Is the Content Static or Dynamic?

The line between static and dynamic content has become blurry as the Web has matured. Most major commercial Web sites are heavily dependent on active scripting (ASP, JSP, Perl, PHP, and so on). From a site-design perspective, static content is content that will be manually updated by a human being. For example, a page with real-time stock quotes would probably not be a static page. A biography of William Shakespeare, on the other hand, would probably be a good candidate for manual updating. Because our product manual describes an actual physical thing that is not going to change (upgrades or new models would require a new manual), a solution that produces static output will be sufficient.

What Is the Target Server Architecture?

Several different Web server platforms are in use on the Internet, but at the time of this writing Apache Server and Microsoft's IIS together control 80% of the market (according to the Netcraft survey, available at www.netcraft.com/survey). Both of these platforms provide integrated support for transforming XML using XSLT on the fly.

Based on these criteria, the content to be displayed should be

  • Viewable by as many different browsers as possible.

  • Completely static (updated only if errors are discovered)

  • Hostable on either Apache or IIS servers.

In this case, generating static plain-vanilla HTML files is a perfectly acceptable solution. The resulting files can be served and indexed efficiently, and can be viewed by most Web clients. The top-down process for generating these files is simple:

1.
Define the number, type, and layout of the pages to be generated.

2.
Write the XSLT script or scripts to generate the output pages.

3.
Use an XSLT transformation tool to create the HTML output.

Although it would be simple to generate a single, monolithic HTML version of the manual, this would place an undue burden on users that have slower connections or limited display capabilities. A better solution would be to create a short table of contents page with hyperlinks to separate pages for each section of the manual. The resulting site layout would resemble what's shown in Figure 18.2.

Figure 18.2. The desired site map of the HTML pages to be generated from VCPManual.xml.


Now the only question is this: What is the best way to generate these pages from the master document? Listing 18.1 shows a condensed version of VCPManual.xml, showing only the top-level elements.

Listing 18.1. A Condensed Version of VCPManual.xml
<?xml version="1.0" encoding="utf-8"?>
<!--

  VCPManual.xml

  Example for unified product documentation from Sams
  _Strategic XML.

-->
<!DOCTYPE manual SYSTEM "product_manual.dtd" [
<!ENTITY model_num "GVP-C125">
]>
<manual>
  <product_info>
    <manufacturer>
      <company_name>GoldStar</company_name>
      <web_site>http://www.lgusa.com</web_site>
    </manufacturer>
    <product_type>Video Cassette Player</product_type>
    <model_num>&model_num;</model_num>
  </product_info>
  ...
  <body>
    <section id="SEC1">
      <title>Cautionary Notes</title>
      . . .
    </section>
    <section id="SEC2">
      <title>Features</title>
      . . .
    </section>
    <section id="SEC3">
      <title>Accessories</title>
      . . .
    </section>
    <section id="SEC4">
      <title>Important Safeguards</title>
      . . .
    </section>
    <section id="SEC5">
      <title>Parts and Controls</title>
      . . .
    </section>
    <section id="SEC6">
      <title>Power Sources</title>
      . . .
      <sub_section>
        <title>When using the AC adaptor (Supplied)</title>
        . . .
      </sub_section>
      <sub_section>
        <title>When using the CAR battery</title>
        . . .
      </sub_section>
    </section>
    <section id="SEC7">
        <title>Making the Right Connections</title>
      . . .
    </section>
    <section id="SEC8">
        <title>Loading and Unloading</title>
      . . .
    </section>
    <section id="SEC9">
        <title>Precautions</title>
      . . .
    </section>
    <section id="SEC10">
        <title>Pre-recorded Tape Playback</title>
      . . .
      <sub_section>
        <title>SPECIAL EFFECTS PLAYBACK
                (BEST RESULTS AT SP &amp; EP SPEED</title>
        . . .
      </sub_section>
    </section>
    <section id="SEC11">
      <title>Duplicating a Video Tape</title>
      . . .
    </section>
    <section id="SEC12">
      <title>Video Head Cleaning</title>
      . . .
    </section>
    <section id="SEC13">
      <title>Before Requesting Service</title>
      . . .
      <sub_section>
        <title>REPLACING THE FUSE</title>
        . . .
      </sub_section>
    </section>
    <section id="SEC14">
      <title>Specifications</title>
      . . .
    </section>
    <section id="SEC15">
      <title>Warranty</title>
      . . .
    </section>
  </body>
</manual>
							

Generating the Table of Contents Page

Our first task is to write the XSLT stylesheet that will generate the table of contents page (index.html). This page should list the title of each section of the document and offer a hyperlink to the related section detail page. Listing 18.2 shows the XSLT template that will produce the HTML table of contents page.

Listing 18.2. Table of Contents Page Template
<xsl:template match="manual">
  <html>
    <head>
      <title>
        <xsl:value-of select="rdf:RDF/rdf:Description/dc:title"/>
      </title>
    </head>
    <link REL="stylesheet" HREF="VCRManual.css" TYPE="text/css"/>
    <body>
      <h1><xsl:value-of select="rdf:RDF/ rdf:Description/dc:title"/></h1>
      <h2>Table of Contents</h2>
      <ul>
        <xsl:for-each select="//section">
          <li>
            <a href="section{ @id} .html"> <xsl:value-of select="title"/></a>
          </li>
        </xsl:for-each>
      </ul>
    </body>
  </html>
</xsl:template>
							

This template loops through each <section> element in the source document and emits an HTML unordered list item containing the title of the section. The title is a hyperlink that points to the actual section detail page, which we will be constructing shortly. Notice that the URL of the section detail–relative page is constructed by appending the element ID of the section to the word “section” and then appending the extension .html.

Emitting the Section Pages

Although our document has 15 distinct sections, they all follow the same basic pattern. Also, we are trying to build a generic document processing system, and future manuals will most likely have a different number of sections. The XSLT transform for emitting a single section as an HTML page is trivial:

<xsl:template match="section">
  <html>
    <head><title><xsl:value-of select="title"/></title></head>
    <link REL="stylesheet" HREF="VCRManual.css" TYPE="text/css"/>
    <body>
      <h1><xsl:value-of select="title"/></h1>
      <p><a href="index.html">Table Of Contents</a></p>
      <xsl:apply-templates/>
    </body>
  </html>
</xsl:template>

However, if this template is executed repeatedly within the same XSLT script, the output will be one long file with several complete HTML pages contained in it. Unfortunately, XSLT 1.0 doesn't have a facility for directing output from a single script into multiple output files. If we were to limit ourselves to basic 1.0 functionality, we would be faced with the unappetizing choice of one of the following:

  • Creating a separate XSLT script for each section (SEC1.xslt, SEC2.xslt, and so on)

  • Resorting to shell scripting and transformation games to automatically generate the required scripts

Luckily, the draft XSLT 1.1 recommendation includes a new element that solves our problem, the <xsl:document> element. The SAXON version 6.1 stylesheet processor supports this element, and it can be used to create multiple HTML documents from a single XSLT script. The new transformation for generating a section document would be this:

<xsl:template match="section">
  <xsl:document href="section{ @id} .html">
    <html>
      <head><title><xsl:value-of select="title"/></title></head>
      <link REL="stylesheet" HREF="VCRManual.css" TYPE="text/css"/>
      <body>
        <h1><xsl:value-of select="title"/></h1>
        <p><a href="index.html">Table Of Contents</a></p>
        <xsl:apply-templates/>
      </body>
    </html>
  </xsl:document>
</xsl:template>

Notice that the href attribute of the <xsl:document> element provides the URL for the output filename. The filename is calculated dynamically by using the attribute value template syntax of XSL ({ @id}) to insert the ID of the current section.

There is also a link to a Cascading Style Sheet that will be included in the resulting HTML file. This provides basic formatting information such as desired font, foreground & background colors, and other style elements of the page as it should be displayed by the browser.

Now that we have solved the problem of generating multiple output files from a single XSLT stylesheet, the rest of the process is a straightforward transformation of XML elements to appropriate HTML output elements. For a more detailed explanation of how this is done, see Chapter 12.

Transforming for Print Media

Although the entire reason for the existence of XML is to allow programmers and authors to separate content from presentation information, at some point XML data will be presented to a human reader. The XSL specification addresses the need to transform XML into a human-readable format through the XSL Formatting Objects mechanism. XSL-FO defines a vocabulary of XML elements that define how to lay out text on a printed page.

Unlike Cascading Stylesheets, XSL-FO is a complete XML application in and of itself. Although it would be possible to write XSL-FO documents directly, in normal usage an existing XML document is transformed using XSLT into an XSL-FO document. Because no tools (at the time of this writing) natively display XSL-FO, another tool is required to parse the XSL-FO document and render it in a better supported format such as PDF, TEX, or PostScript. The XSL-FO renderer used to develop this application is the Apache XML Project's Fop processor.

The two-stage nature of the rendering process makes solving problems with the output document difficult. There can be errors in the XSLT transform itself, or in the resulting XSLT-FO script. During the development process, it is often useful to take an initial rough cut at the XSLT script, then fix a few errors in the XSL-FO script directly. After the fixes in the XSL-FO script have been verified, the XSLT script can be updated.

Generating a Basic XSL-FO Document

Listing 18.3 shows an XSLT transform that will generate an extremely basic, but functional, XSL-FO document from any XML document. This is the starting point for applying more sophisticated styles to improve the quality of the output document.

Listing 18.3. A Basic XSL-FO XSLT Transform
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.1" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:fo="http://www.w3.org/1999/XSL/Format">
  <xsl:template match="*">
    <fo:block><xsl:apply-templates/></fo:block>
  </xsl:template>

  <xsl:template match="/">
    <fo:root>
      <fo:layout-master-set>
        <fo:simple-page-master margin-right="1in" margin-left="1in"
            margin-top="1in" margin-bottom="1in"
            page-width="8.5in" page-height="11in"
            master-name="normal">
          <fo:region-body/>
        </fo:simple-page-master>
      </fo:layout-master-set>

      <fo:page-sequence master-name="normal">
        <fo:flow flow-name="xsl-region-body">
          <xsl:apply-templates/>
        </fo:flow>
      </fo:page-sequence>
    </fo:root>
  </xsl:template>
</xsl:stylesheet>
							

This script will generate an XSL-FO document that displays all the character data from each XML element in its own distinct XSL-FO block.

Completing the Print Layout

The XSL-FO specification is very complex, and at the time of this writing it has still not been approved by the W3C. It would require another book as long as this one to adequately explain the full XSL-FO specification. The XSL-FO transformation provided for this manual can be used as a starting point for other projects. The best way to learn how the transformation works is to visit the book's Web site and download the complete example. Modifying the sample files and viewing the resulting document is a good way to rapidly learn XSL-FO.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.114.244