CyberCinema: The Rosetta Stone Meets the Web

Let's consider our previous example of the CyberCinema Web site. The considerations here are a bit more complex, but the same basic process applies: Start by roughing out some XML instances and then move on to DTD design. The complexity is introduced when we consider our user population and how users are going to interact with the XML instances in the system.

Let's start by listing a few assumptions, derived from our requirements document (see Chapter 3) and subsequent data model (see Chapter 4):

  1. For CyberCinema, each review is represented by a single XML instance.

  2. Each review is stored in a database.

  3. In this database, we also track movies, review authors, actors, and directors.

  4. Review authors can refer to specific movies, actors, and directors in the text of their articles, and those references can be hyperlinks (for example, to a filmography of a particular actor or director).

  5. The primary delivery platform for CyberCinema movie reviews is the Web, although the reviews also are syndicated to other sites, to information retrieval services, and potentially to print media.

The first three requirements are fairly straightforward, but with requirements 4 and 5, things start to get hairy. Splitting the problem into two domains makes sense at this point. The two domains are recordlike data and narrativelike data. XML is good for both domains, but understanding the difference is extremely important to effective DTD design.

Recordlike data is anything that sounds like it would feel at home in a traditional relational schema, for instance, author information, a headline, or the title of the movie the review is about.

Narrativelike data, generally speaking, is anything where order is meaningful. For example, if you rearranged the words in this sentence, as follows, the meaning of the sentence is lost:

Sentence, if for the in, as example, rearranged the meaning follows you of the is sentence words lost this.

However, as noted previously, if you rearrange the declarations in our e-mail DTD, they still form a valid DTD declaration. The DTD declaration is recordlike, whereas the sentence is narrativelike. The term narrative-like may be a bit confusing because we're still talking about all kinds of data (words, images, links, formatting, and so on), not just narrative data in its strictest sense.

So that your documents make sense, I suggest you separate them into a head section and a body section. The head section should contain recordlike information, and the body section should contain narrativelike information. This is not to say that there is no gray area between recordlike and narrativelike information. The rule of thumb is that the main part of a document that usually renders as one block of text is the body; anything else goes in the head.

Given that we're separating our movie reviews into head and body information, a skeleton structure for an instance of a review looks like this:

<?xml version="1.0">
<CYBERCINEMA_REVIEW ID="123">
      <HEAD>
            <!--  Header information  goes here  -->
      </HEAD>
      <BODY>
            <!--  Body information goes here  -->
      </BODY>
</CYBERCINEMA_REVIEW>

A Note on White Space

All XML instances presented here are formatted for readability, so that white space (carriage returns, tabs, or spaces) appears between tags. In XML, white space is meaningful, which can be tricky. What this means is that if you have a space between the end of one tag and the start of another, an XML parser will view that space as a meaningful part of your document. If you want to be able to have white space in your document, you need to account for it in the content model of your elements. I think it's better to ensure that your XML instances don't contain any white space between tags, except where it is meaningful to the document itself. Depending on what authoring environment your user population employs, white space may be taken care of for you. In the systems I've built, special routines have always been needed for dealing with white space, or the lack thereof, in XML instances on their way in and out of the system. Your XML editing environment should take care of presenting the XML instances in a friendly way (not all on one line), so you really shouldn't need to have extra white space characters in the instances themselves.


The Head

We've defined the head of our XML instance as containing recordlike information. Looking back at our requirements for CyberCinema, we see that the following recordlike pieces of information are associated with each review:

  • Author

  • Headline

  • Summary or abstract of the review

  • Date the review was “published” (that is, the date the review was released, not necessarily the date it was created or last modified, so you may also want to track the create date and the last modified date)

  • Movie being reviewed

Dates

Throughout this book, when I refer to dates or timestamps, I'm talking about “date and time,” as in an exact measurement of the date and time. We'll get into how exact later (see the sidebar A Brief History of Time later in this chapter).


Remember, information about actors and directors and about the graphics embedded in reviews belongs in the body not in the head of the review. The body is perfectly capable of storing structured information; it isn't an unstructured blob that is included for display purposes only.

Singular Versus Plural: Putting Together Blocks

One important question to ask yourself about each piece of information you put in the head of the instance is: Is it plural or singular? We've stated that reviews can have more than one author, so author information is plural. A review shouldn't have multiple headlines or abstracts, so the headline and abstract are singular. An article can't be created twice or published for the first time twice, so these events are singular. They happen only once, so the date stamps for them are also singular. Although an article can be modified more than once, in this example we're tracking only the time the article was last modified.

Organizing plural elements within blocks is a convenient way to group them together and set them apart from the other elements within the head, like organizing files in file folders. Let's continue with our example of movie reviews. The head consists of a single “author block,” which contains all author information, and then a set of singular items: REVIEWED, HEADLINE, ABSTRACT, CREATE_DATE, LASTMOD_DATE, and PUBLISH_DATE. By using a single author block to store all of the author information, you also make it easier to extract this information from the XML later on because extracting it requires only one operation: finding and retrieving the author block. The alternative would be to find and retrieve each author entry separately.

The head of our document is starting to take shape. Building on the skeleton we constructed previously, we have a good idea of what the head looks like:

<?xml version="1.0">
<CYBERCINEMA_REVIEW ID="123">
      <HEAD>
            <!--  Header information  goes here  -->
            <AUTHOR_BLOCK>
                  <AUTHOR ID="123">Daniel Appelquist</AUTHOR>
            </AUTHOR_BLOCK>
            <REVIEWED ID="3827">Gone With the Wind</REVIEWED>
            <HEADLINE>Classic Film Still Fresh</HEADLINE>
            <ABSTRACT>This film is often thought of as the best example of
            classic...</ABSTRACT>
            <CREATE_DATE DATE="2000-05-17T17:10:00,0"/>
            <LASTMOD_DATE DATE="2000-05-17T18:12:00,0"/>
            <PUBLISH_DATE DATE="2000-05-17T19:27:00,0"/>
      </HEAD>
      <BODY>
            <!--  Body information goes here  -->
      </BODY>
</CYBERCINEMA_REVIEW>

You'll notice that the date stamp elements in the HEAD element look a little funny—their tags end with a forward slash, as in <PUBLISH_DATE DATE=""/>. This forward slash at the end of a tag is the notation for an empty element. It is one of the main differences between XML and the markup languages that have come before it (such as HTML). It's a shorthand way of using an opening tag <PUBLISH_DATE DATE=""> and a closing tag</PUBLISH_DATE> next to each other. In fact, these two bits of code (the single XML date tag that ends with the forward slash and the pair of opening and closing tags) are functionally identical; <PUBLISH_DATE DATE=""/> is just a more convenient and cleaner way to represent elements that don't have contents (such as these date stamps, which have only attributes).

A Brief History of Time

As in the previous example, when a date and/or time must be specified in an element (as in the <PUBLISH_DATE> element), a good way to do it is to use the following format:

CCYY-MM-DDThh:mm:ss,s

Each of the letters in the preceding example is a placeholder for a digit. CCYY indicates century and then year, MM is the month, and DD is the day of the month. T is a literal character, hh is the hour of the day (in 24-hour time), mm is the minutes of the hour, and ss is the seconds in the minute. The final s represents tenths of a second—useful when you really need to be precise. This format is the Extended Format of calendar date and local time of day as described by the International Standards Organization (in ISO8601, section 5.4.1, clause a). Because the ISO has provided a standard way to represent date/time in your XML instances, don't reinvent the wheel. Use this format and convert dates in and out of it for all the other date formats you need (for instance, for display purposes in another format such as May 17, 1997 or 1997-05-17 or to insert a date into a date field in a database).

For example, the formatted string for May 17, 1997, at 17:00 is

1997-05-17T17:00:00,0

To access the complete ISO standard in PDF form (you will be amazed at how much can be written on the topic of “What time is it?”), check http://www.iso.ch/markete/8601.pdf.

To complicate things a bit more, whenever you give something a timestamp, you have to consider the issue of time zones. This issue is especially important for systems that operate over the Internet because your XML instances frequently will cross time zones. One approach, which I favor, is to represent all timestamps in GMT (Greenwich mean time). Make sure that every time you represent a timestamp, you're translating it to the current time zone. (You also have to make allowances for daylight savings time, which requires care because daylight savings time changes at different times depending on where on earth you are.) All other time zones are annotated in relation to GMT (for example, U.S. eastern standard time is GMT-5) so representing timestamps in GMT is quite natural and not at all a throwback to a bygone era of the British empire.

Of course, you could reject using GMT and use Swatch's “Internet-Time” (http://www.swatch.com/). But people might think you're on hallucinogenic drugs, so I would shy away from it if I were you.


A DTD fragment for the previous DTD follows:

<!ELEMENT HEAD (AUTHOR_BLOCK, REVIEWED, HEADLINE, ABSTRACT, CREATE_DATE, LASTMOD_DATE,
 PUBLISH_DATE)>
<!ELEMENT REVIEWED #PCDATA>
<!ATTLIST REVIEWED
      ID                        NMTOKEN                      #REQUIRED
>
<!ELEMENT HEADLINE #PCDATA>
<!ELEMENT ABSTRACT #PCDATA>
<!ELEMENT CREATE_DATE #CDATA>
<!ATTLIST CREATE_DATE
      DATE                      CDATA                        #REQUIRED
>
<!ELEMENT LASTMOD_DATE #CDATA>
<!ATTLIST LASTMOD_DATE
      DATE                      CDATA                        #REQUIRED
>
<!ELEMENT PUBLISH_DATE #CDATA>
<!ATTLIST PUBLISH_DATE
      DATE                      CDATA                        #REQUIRED
>

In this example, you'll note that the content models for some of our elements are defined as PCDATA. This stands for “Parseable Character Data,” as opposed to character data (CDATA), which we saw in our first example in the section Building a DTD. PCDATA content can include parseable items such as other tags. Any “mixed content” areas of your document, where you want to combine text and tags, have a content model of PCDATA.

The Body

The body of the XML instance is where all the really interesting stuff happens. The head has a tightly constrained content model, whereas the body is full of mixed content, that is, parseable character data mixed with elements. In the body of the document, you first realize why using XML is a great way to represent this kind of data. XML brings structure to the unstructured world of narrativelike data.

Let's examine what we want to do in the document body. The body should include the following:

  1. Normal text markup features, such as bold, italic, and underline.

  2. Headings to delineate one section of a review from another.

  3. Actor names, movie names, and director names must be links into a search facility of some kind.

  4. Links to other reviews.

  5. Links to outside URLs.

The first item designates normal markup features. For simplicity's sake we're going to borrow HTML-style markup for things like italics, bold, and headings; there's no need to reinvent the wheel.

Item 3 is actually “two, two, two items in one!” In Chapter 4, we figured out that directors and actors are both “people” (remember the person entity in Chapter 4?). Looking back at the data model diagram (in Figure 4-5), we see we're also tracking movies as entities. Thus we need some way to mark a word or phrase within the text of our review as having to do with a movie or a person. Items 4 and 5 both fit into the same category, linking a part of the instance to some external entity. Let's take a closer look at linking.

Linking Up: XLink

If you're familiar with HTML and the Web, you think of a link as an underlined phrase or image that, when clicked, triggers a Web browser to fetch another page. This relatively familiar paradigm is only one kind of link: a hypertext link. XML and XLink, the recently completed XML Linking Language specification, enable you to define many more link properties and behaviors. Any XML tag can be defined as having the properties of a link.

XML Linking Language (XLink)

The XLink specification is a redefinition of linking. If you're a Web user, you probably think of a link as “something you click to go somewhere else.” The concept of a “link” is key to the metaphor of the Web. Once you get the way links work, the user interface of the Web suddenly makes sense. It's an immensely valuable idea, but the XLink spec rigorously attempts to define and enlarge this idea. With XLink linking, links can be bidirectional, they can be managed so that they never go “stale,” and they enable you to add your own links from documents you don't own.

One immediate benefit of XLink is a syntax that enables you to represent links generically in your XLink documents in a way that can be reflected easily in a link database (because each link is assigned a unique ID). For that reason and because it's a good idea to support standards (remember the Rosetta stone), it's a good idea to use the XLink syntax, but don't get too bogged down in the details.

For a full description of the XLink spec syntax, check out http://www.w3.org/TR/xlink/. Tim Bray has also written a primer on XLink at http://www.xml.com/pub/a/2000/09/xlink/.


The following DTD fragment is an example of how to implement the movie element using the XLink syntax. XLink requires the definition of the following attributes: type (the kind of link it is), href (what it's linking to, the hypertext reference), show (what should happen when the link is actuated, either clicked or the equivalent action in whatever interface you're dealing with), and actuate (how the link is actuated, either by user action or by default on the page loading). It's beyond the scope of this book to go into all the permutations of these options. Suffice it to say that the kinds of links we're talking about are locator links (they locate another resource, such as a page, URL, movie title, and so on). Clicking on them replaces what the user is looking at. The user must click the link to activate it, so activate's value is onRequest.

<!ELEMENT movie ANY>
<!ATTLIST movie
      xlink:type         (simple|extended|locator|arc)
                                                   #FIXED "locator"
      <!-- This is a locator link because it points to an external
          resource -->
      xlink:href         CDATA                     #REQUIRED
      xlink:show         (new | embed | replace)   "replace"
      <!-- When link is actuated (such as with a click) should the linked-to
          data come up in a new window, be embedded in the current window or
          replace the current content? -->
      xlink:actuate      (onRequest |onLoad)       "onRequest"
      <!-- How should the link be activated? Default is on user request
          (for example, the user clicks on the link text) -->
>

This DTD fragment instantiates our movie element as a simple link using the XLink syntax. Those familiar with HTML may find it strange to use the href keyword where it isn't pointing to a URL. In XLink language, an href is just a way to point to an external resource. In this case, we're pointing to a movie, a record of which is stored in a relational database table (and therefore is referenced by a numerical ID).

The preceding example might look complex, but because most of the attributes are implied or fixed, they aren't required in the actual code. A sample of our new movie element looks like this:

<MOVIE xlink:href="12345">Ben Hur</MOVIE>

That's it! If we assume that our PERSON and REVIEW elements are going to be similarly defined, we come up with the following full XML instance for both our MOVIE and PERSON elements:

<?xml version="1.0">
<CYBERCINEMA_REVIEW ID="123">
     <HEAD>
           <!--  Header information  goes here  -->
           <AUTHOR_BLOCK>
                <AUTHOR ID="2">Daniel Appelquist</AUTHOR>
           </AUTHOR_BLOCK>
           <REVIEWED ID="12345">Spartacus</REVIEWED>
           <HEADLINE>Roman Holiday</HEADLINE>
           <ABSTRACT>This film marks the pinnacle of historical action
           drama.</ABSTRACT>
           <CREATE_DATE DATE="2000-05-17T17:10:00,0"/>
           <LASTMOD_DATE DATE="2000-05-17T18:12:00,0"/>
           <PUBLISH_DATE DATE="2000-05-17T19:27:00,0"/>
     </HEAD>
     <BODY>
           <!--  Body information goes here  -->
           The film <MOVIE link:href="12345">Spartacus</MOVIE> stars <PERSON
           link:href ="932">Tony Curtis</PERSON> as a fun-loving slave. Often
           confused with
           <MOVIE link:href="12346">Ben Hur</MOVIE> (see our <REVIEW
           link:href ="876">review</REVIEW>), this 1960's classic is...
     </BODY>
</CYBERCINEMA_REVIEW>

It gets really interesting when you store these links in your relational database, which I describe in Chapter 6.

After you've properly identified the movie titles and the names of people in the body of the review, it's up to you how you want those elements to appear in the representation of this review to the viewing public. (See the Appendix for the complete CyberCinema DTD.)

One challenge when dealing with links (XLinks and otherwise) is keeping them fresh. Using XLink enables you to keep links fresh, but your application still has to do some work.

Keeping References Current

References to other movies should enable end users to locate reviews of those movies, even if the reviews are written and/or published later. You don't want to modify every review that references Spartacus just because someone wrote a new review of it. Your application should keep these references current for you. As you add reviews, you link them to movies through their database IDs. In most cases, you shouldn't have to change already created XML just because you're adding something new.

Deleting items (such as reviews) is another matter, and it is more complicated. If you delete a review, for instance, and five other reviews are linked to this review, you'll have to remove the links in those five reviews before the database will actually enable you to remove the review you're trying to remove (thanks to relational integrity). Although you can do this, it's not very efficient. Hence, be careful about which pieces of data you “decompose,” that is, extract from the XML and store in the relational database.

You may want to consider adding an “inactive” flag to your database table where your content is stored. That way, you can delete items by turning this flag on, and you won't have to worry about traversing many documents if you want to delete one of them. In fact, I never delete anything; I simply flag certain items as inactive, enabling me to recover them later if I've made a mistake. It also enables me to retain relational integrity in my database without having to jump through hoops. I'll discuss this further in Chapter 6 when we get into relational schema design.

Dealing with Arbitrary Binary Data

Another thorny issue when dealing with content in the bodies of your XML instances is what to do with binary data. Especially in the framework of content management, you often find yourself dealing with nontext data, such as images, sounds, video, and smells.[1] There are two approaches to mixing XML and binary data:

[1] Don't laugh; a company called DigiScents (www.digiscents.com) is doing just this. Very cool stuff.

  • Point to the binary data externally (as in an external file on the file system or in a database or as in a Uniform Resource Identifier—URI—that points to the data). This is the approach that HTML has taken. The big problem here is data synchronization. For example (taken from XHTML):

    <image src="http://www.torgo.com/torgo.gif"/>
    
  • Encode the binary data in the XML document itself, using base64 encoding. The problem here is that encoding binary data whenever you want to store it and decoding it again when you want to use it isn't terribly efficient. For example:

    <image dt:dt="binary.base64">84592gv8Z53815b82bA68g</image>
    

    The preceding encodes the binary data into a string of numbers and letters, using the base64 encoding algorithm, a commonly used encoding standard, encoders, and decoders that are built in to many programming environments (Java and Perl, to name two).

Deciding which is the best approach depends on your application. If you're managing content in an internal application (that is, you're not packaging content to send to other businesses), you shouldn't go through the bother of encoding binary data into the XML documents. It's better to use an external reference. However, if you're shipping your XML over the Internet or over a private network to customers and other businesses, it makes sense to “wrap up” the binary data with the XML when you send it (as with the base64 encoding example).

Extensible Hypertext Markup Language

XHTML stands for eXtensible HyperText Markup Language, but it's more descriptive to call it “XML-compliant HTML.” It's HTML that has been slightly modified to make it comply with XML. For instance, (if you're familiar with HTML), the HTML tag for image is <IMAGE SRC="foo.gif">. That's an illegal construct in XML because in XML every open tag has to be accompanied by a close tag in order to preserve the tree-structure of the data in an XML document. Equivalent XHTML for this would be

<IMAGE SRC="foo.gif"></IMAGE>

or simply

<IMAGE SRC="foo.gif"/>

The latter is shorthand for the former.

Likewise, the <BR> tag (another tag that appears alone) must be represented as <BR/>. The rest of XHTML is similarly defined. Another important difference between HTML and XHTML is the use of the <P> tag. In HTML, the <P> tag is often used just to insert some white space. In XHTML, the <P> tag is an enclosing tag:

<P>This is a paragraph</P>

All tags are containers; this conceptual difference is one of the most difficult concepts for HTML jockeys to understand about XHTML. Actually, having <P> tag without a corresponding end tag has been technically illegal since HTML 2.0, but because no Web browsers enforced this rule, this has been the de facto usage. Because XHTML (and all documents written in XML-derived languages) must be well formed, that kind of sloppiness is no longer allowed.


Building XML DTDs: Let the Experts Do the Hard Stuff

In providing examples of DTDs earlier in this chapter, I found it handy to pick and choose DTD fragments from elsewhere (such as from HTML or from DTD samples available in public Internet repositories) and then combine them to suit my needs. However, be warned: You should first understand how DTDs are constructed and how to define them properly. Only then should you use this technique as a means of minimizing the time it takes to define a DTD for your needs. Luckily, an entire community of XML professionals is more than willing to put their work in the public domain. The best way to become a competent DTD author is to read the DTDs (and DTD documentation) that others have produced.

I have limited the number of URLs provided in this book because they often go stale (especially when they're provided in printed documents), so the ones in this section are the URLs to the home pages for these sites. You can use these URLs as starting points in your search for XML resources and information. In particular, these sites provide DTD examples and fragments and DTDs tailored for specific needs.

  • W3C at http://www.w3.org. Go to the source: the World Wide Web Consortium maintains an archive of XML DTDs and specifications and provides supporting documentation. This is an excellent resource for anyone building DTDs and should be your first stop.

  • XML Catalog at http://www.xml.org. This is probably the best overall site for pointers to XML resources on the Web. The XML Catalog area is a particularly good place to look for industry-specific DTDs and DTD fragments. Financial reporting? Check out the Accounting section. Building documents about the next generation space shuttle? Check out the Space section. You never knew so many disciplines were using XML of one kind or another.

  • Resources at http://www.xml.com. As the name suggests, xml.com is more commercially oriented than xml.org, which can be both a good and a bad thing. You'll see more links to interesting projects being worked on in the commercial software community, but less funky leading edge stuff. Both xml.org and xml.com are good places to look, and no DTD scavenging session is complete without them.

  • xCBL (XML Common Business Library) at http://www.xcbl.org. xCBL is an excellent resource for business-oriented XML DTD fragments and other resources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.255.127