Chapter 1

XML

1.1 Introduction

The title of this book is Querying XML, so we start by introducing XML, describing what we mean by “querying,” and then discussing the special challenges in querying XML.

XML – the Extensible Markup Language – defines a set of rules for adding markup to data. Markup adds structure to data, and gives us a way of talking about the meaning of that data. The family of XML technologies provides a way to standardize the representation of data, so that we can process any data with standard programs, share data across applications, and transfer data from one person or application to another. In this first chapter, we introduce XML by looking at what markup is and what it’s good for. Then we look at a number of different uses for XML – a number of different kinds of XML data. Finally, we give examples of other ways to represent data, and compare them with XML.

1.2 Adding Markup to Data

Let’s take the movies example (Appendix A: The Example) used throughout this book. We have data describing many of our favorite movies. The data includes the title of the movie, the year it was first released, the names of some of the cast members, and other information about the movie. In this section, we look at the data in its raw form, then discuss how that data might be marked up to make it more useful.

1.2.1 Raw Data

We could represent our movie data in raw form, as in Example 1-1.

Example 1-1   movie, Raw Data

image

Example 1-1 is the raw data for one movie – a single record. In this format, the data doesn’t tell you much about the movie. You can probably spot the title, and, if you are familiar with “An American Werewolf in London,” you may be able to glean some information by means of educated guesswork. But if you wanted to write a program to read this data and do something with it – such as finding the name of the director – you would have to write code specifically for this piece of data (e.g., code that extracts the characters at positions 41 through 44 and 35 through 40 and adds a space in between them). What we need is some way to represent the data so that a program (or person) can process any movie record in the same way.

1.2.2 Separating Fields

A simple way to add some rudimentary structure to this record is to add a comma between each of the data items, or fields.

Example 1-2   movie, Fields Separated by Commas

image

Example 1-2 is the same movie data represented as a comma-separated list. Notice that, even with this simple mechanism, we had to introduce the “” (backslash) character to “escape” a comma that was actually part of the data.

There are other ways to distinguish between fields of a record. In the early days of computing, fixed-length fields were common – each field might occupy, say, 8 bytes. This method makes access simple – if you want to access the beginning of the third field, you can go directly to the 17th byte. But fields smaller than 8 bytes take up more space than they need to, and fields longer than 8 bytes require some indication that they are spread across more than one field (such as a continuation marker).

Let’s continue our discussion with the comma-separated list in Example 1-2. You can spot the fields in this record, but there is no way of knowing which fields go together. For example, the fields “Agutter,” “Jenny,” “female,” and “Alex Price” each describe one aspect of a cast member, but it’s not apparent from the comma-separated list that those fields have anything in common. We have a way of delineating fields; now we need some way of grouping fields together.

1.2.3 Grouping Fields Together

Example 1-3 groups fields together. It also introduces a hierarchy of fields and subfields. Fields are separated by one or more commas, and fields that belong together are bounded by “,” at the start and “$,” at the end.

Example 1-3   movie, Grouped Fields

image

image

Example 1-3 is shown with some extra white space – each sub-field starts on a new line, and is indented. This is purely for (human) readability.

Now we know that “Agutter, Jenny, female, Alex Price” all belongs together and is all related in some way to “An American Werewolf in London.” And if you want to write a program to extract the director of each movie, given that each movie is formatted in the same way as in Example 1-3, you can write some general code that will parse the movie into first, second, and third fields, extract the contents of the third field, and parse that to get the first and last name of the director.

We are making progress! But Example 1-3 still has some shortcomings. There is no indication of what a field represents, other than its position within the record, which makes it difficult for humans to read. This has two implications – first, the data is vulnerable to error. If you (or the program generating the data) make a mistake and leave out the year of release, it’s not obvious that anything is missing, and a program processing this data may well return “LandisJohn” when asked for the year of release. Second, it makes it difficult to talk about the data. Most of the time, when we want to “talk about” the data, we want to describe some manipulation to a program – i.e., it’s difficult to write a program that says things like “print the second field of the third field of the movie record, then a space, then the first field of the third field of the movie record.” Our next step is to name the fields and subfields.

1.2.4 Naming Fields

If you read Example 1-3, you can probably guess that “An American Werewolf in London” is the title of the movie, and you may even deduce that Jenny Agutter plays the female lead, a character named Alex Price. But who is Peter Guber? And what does “98” mean? What we need is a way to name each field, to make it easier to talk about the fields – to write programs that manipulate them – and also to give some clue as to what the fields actually mean. We could devise a way to represent field names as part of our comma-separated list – perhaps each comma would be followed by a field name in double quotes. Fortunately, we don’t need to – we have XML.

Example 1-4   movie, Fields Grouped and Named

image

Example 1-4 is close to the XML representation of movie data that we will use for the rest of this book. The “,” and “$,” have been replaced by “<tagname>“ and “</tagname>.“ Each field in this record – in XML terms, each element in this document – has a name. We can now refer to elements by name and by their position with respect to other named elements. And when the name is something meaningful, such as “producer,” it gives a hint to the human reader about what the data means. All we need now is a map of the data – actually two maps, one to tell us what the structure of a movie record (a valid movie document) looks like, the other to tell us what each element actually means.

1.2.5 A Structural Map of the Data

One useful kind of data map tells you something about the structure, or “shape,” of the document – which fields are subfields of others and in what order they can appear in the document. Such a map is obviously useful for someone manipulating the data, since she needs to know that the director element contains a family Name and a givenName. It’s also useful for error-checking and consistency – every movie has a director, so if the director element is missing, then the data is corrupted or at best incomplete. Let’s take a look at a couple of structural data maps for XML – DTDs and XML Schemas.1

DTD – Document Type Definition

An early attempt at providing a map for XML was the DTD, or Document Type Definition (actually the DTD was inherited from SGML – see Section 1.5.3). A DTD defines what elements and attributes are allowed, where, and in what order. A DTD may also enumerate the values allowed for each attribute (but not for elements), and it may identify some attributes as type ID (meaning they must have a value that is unique across the XML document) or IDREF (meaning they must match some attribute of type ID). Example 1-5 shows a possible DTD for the movie document.2

Example 1-5   A DTD for movie

image

The first line of Example 1-5 says that a movie must contain a title, a yearReleased, a director, at least one producer, a runningTime, and at least one cast (member), in that order. The following lines describe the “shape” of each of these elements. Each simple (leaf) element, though, is described as “#PCDATA” – despite its name (Document Type Definition), the DTD does not give us any data type information.3 For example, it does not distinguish between runningTime (which is probably an integer) and title (which is probably a string).

XML Schema

DTDs have a couple of drawbacks – they don’t include any data type information about fields,4 and DTDs are not XML documents. XML Schema solves both these problems. Like a DTD, an XML Schema defines where elements may occur in a document, and in what order, in a formal, standard way. But an XML Schema may also describe the data type of the element (integer, string, etc.) and give rules about which values are allowed. And an XML Schema document is itself an XML document, with its own XML Schema.5 Example 1-6 shows a possible XML Schema for the movies document.

Example 1-6   An XML Schema for movie

image

image

In Example 1-6, each element in the XML document is described by an element in the XML Schema called xs:element. A simple element such as title is modeled with the attributes xs:name=“title” xs:type=”xs:string”. An element that has children (subfields), such as director, is described by an xs:complexType element – in this case, a sequence of elements. The elements familyName and givenName occur in several places in the XML document (the instance document), so they are defined once at the start of the XML Schema and are pointed to (via the ref attribute) whenever needed.

XML Schema has a rich set of data types that can be attributed to elements – we have described runningTime as type xs:integer, so it is now distinguishable from the xs:string elements.

Example 1-6 also illustrates two more capabilities of XML Schema:

1. Bounding – yearReleased is an integer and is bounded to be between 1900 and 2100 inclusive.

2. Enumeration – maleOrFemale is a string, and it may only have the value male or female.

DTD, XML Schema, and Others

We have only briefly touched on DTDs and XML Schema, to give a general flavor of each approach. There are other approaches to mapping or modeling data, notably RELAX NG. At the time of writing, DTDs may still be the most common way to describe data. But there seems to be a general move toward XML Schema, with most existing users planning, or at least considering, a move from DTDs to XML Schema and most new users adopting XML Schema. For the rest of this book we will talk in terms of XML Schema, with occasional references to DTDs.

1.2.6 Markup and Meaning

With the XML document in Example 1-4 and the XML Schema in Example 1-6, we know an awful lot about the data.

1. We know how to break it down into meaningful pieces (elements, sub-elements, sub-subelements).

2. Each element and subelement has a meaningful6 name, to improve readability and to refer to the element easily.

3. We have a Schema that describes rules for the data (such as order, type, scooping, and legal values).

Have we succeeded in representing the meaning of the data? Somewhat. We know a lot more about the data, but we still don’t know its meaning.

There are several things we could do to add to what we already know about the data, getting us closer to representing the meaning. First, we could add more tags – for example, we could split the “yearReleased” element into subelements “USA,” “Europe,” and “Asia.” See Chapter 4 for a discussion of semantic markup and metadata. Second, we could use RDF (again, see Chapter 4) to denote which “John Landis” was the producer of this movie and to define relationships to enable inference about the data. But the next logical step is to define an XML-based markup language – i.e., to create a formal definition of the meaning of the data within each of the allowable elements, separate from the actual markup (syntax) definition.

We said at the beginning of this chapter that XML is an Extensible Markup Language – more accurately, it is a language or framework for defining markup languages. This topic is important enough for its own section in this chapter – see Section 1.3.

1.2.7 Why XML?

Before we discuss XML-based markup languages, we want to make sure that you agree that XML is a good thing.

With the movie data marked up as XML and with an XML Schema to describe the data, our movie data is fairly human-readable. Perhaps more important, it is machine-readable. A collection of XML documents plus an XML Schema provide all the information necessary for a program to process the data in a standard way. Any XML parser can parse the document and report errors, any XSLT engine can transform the document [e.g., for display or printing), and any XML-aware query engine can query it.

You could devise your own markup – as we started to do with the comma-separated list. But we recommend using XML instead, for at least the following reasons.

1. Designing a markup strategy is not as simple as it might seem. With our very simple earlier example, we already had to deal with choosing symbols for start and end markers, and escaping marker symbols that appear in actual data. Many man-years of effort have gone into defining XML and its family – why reinvent the wheel?

2. If you use your own markup system for tagging data, you will also need to reinvent a way to describe that data (XML Schema). And you will need to create a suite of tools to process your documents – tools that understand your homegrown tagging system.

3. If you use XML, you can leverage a family of technologies to store, manage, publish, and query data. There is a good chance that your customers, suppliers, and software applications use XML too, so you can share and exchange data with a minimum of effort.

1.3 XML-Based Markup Languages

You read in Section 1.2 that XML defines the representation of a piece of data – e.g., a start tag/end tag pair delimits an element. An XML-based markup language defines the meaning of that representation.

An XML Schema defines a set of elements and attributes and how they can be put together. An XML-based markup language consists of an XML Schema plus a human-readable description of the meaning of the elements and attributes. In terms of a human language, you can think of XML as the alphabet and vocabulary; an XML Schema7 adds grammar (syntax); an XML-based markup language builds on the alphabet, vocabulary, and syntax, adding the semantics of the language. Let’s look at some examples to make these distinctions clear.

MDL – Movie Definition Language

We could create an XML-based markup language for our movie data by taking the XML Schema in Example 1-6 and adding a definition of the semantics of each element. For example:

image

These semantic definitions are human-readable, not machine-readable. The Semantic Web8 is an attempt to make semantics machine-readable (see Chapters 4 and 18). The Semantic Web is still very much a work in progress – in the meantime, let’s look at some XML-based markup languages that are already widely used.

XBRL – Extensible Business Reporting Language

XBRL is an XML-based markup language developed by a consortium (XBRL International) to make it easy for businesses to exchange financial reporting data. The definition of XBRL includes a set of XML Schemas, plus additional syntax rules, to define what is legal in an XBRL instance. XBRL also includes a precise definition of the semantics of each element and attribute. For example, the attribute “precision” is defined in the XBRL 2.1 spec9 like this:

The precision attribute MUST be a non-negative integer or the string “INF” that conveys the arithmetic precision of a measurement, and, therefore, the utility of that measurement to further calculations. Different software packages may claim different levels of accuracy for the numbers they produce. The precision attribute allows any producer to state the precision of the output in the same way. If a numeric fact has a precision attribute that has the value “n,” then it is correct to “n” significant figures (see Section 4.6.1 for the normative definition of ‘correct to “n” significant figures’). An application SHOULD ignore any digits after the first “n” decimal digits, counting from the left, starting at the first nonzero digit in the lexical representation of any number for which the value of precision is specified or inferred to be “n.”

The meaning of precision=“INF” is that the lexical representation of the number is the exact value of the fact being represented.

The first part of this definition – “The precision attribute MUST be a non-negative integer or the string “INF” – can be expressed in XML Schema, and the spec does include an XML Schema for “precision.” The rest is semantic and must be defined as part of the markup language.

Dublin Core

Dublin Core10 defines a markup language for catalog metadata. Dublin Core was initially designed to address the cataloging of books, but it has been extended to cover any kind of information resource, digital or physical. Dublin Core is used extensively in libraries to maintain a rich set of metadata about books, pictures, manuscripts, etc. to make it easier for people to find and browse resources.

Dublin Core defines a set of 15 elements that express the core catalog metadata of a resource. For each element there is a single-word, normative name (e.g., Title, Creator, Subject); a descriptive label, meant to convey the meaning of the element to human readers; a definition, giving a more precise semantic definition; and a comment. Table 1-1 shows three of the elements defined by Dublin Core, taken from DMCI Metadata Terms.11

Table 1-1

Dublin Core Metadata Definition, Sample

Element Name: Title    
  Label: Title
  Definition: A name given to the resource.
  Comment: Typically, Title will be a name by which the resource is formally known.
Element Name: Creator    
  Label: Creator
  Definition: An entity primarily responsible for making the content of the resource.
  Comment: Examples of Creator include a person, an organization, and a service. Typically, the name of a Creator should be used to indicate the entity.
Element Name: Subject    
  Label: Subject and Keywords
  Definition: A topic of the content of the resource.
  Comment: Typically, Subject will be expressed as keywords, key phrases, or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme.

Dublin Core also includes a set of XML Schemas12 that define how to express these elements in an XML document. Note that Dublin Core does not define a language for a whole document – Dublin Core elements are meant to be inserted in XML, RDF/XML, or HTML documents.

DocBook

DocBook13 defines a set of XML tags for use in creating marked-up books, articles, and documentation. Why would anyone write a document using markup (such as XML) instead of a word processor (such as Microsoft Word)? First, it’s easier to index (for searching) and categorize (for browsing) a document that has some semantic markup. The semantic markup might be additional metadata – i.e., data that is not part of the printed document, such as the intended audience for each chapter. Or it might mark semantic boundaries – e.g., if you mark up the examples in a document, it’s easy to search for examples that feature some term. Careful use of styles in an editor such as Microsoft Word could help with searching and browsing, but people rarely use formatting styles so precisely. Second, when you create a document in XML you separate the content of the document from its physical representation. You can write the content once and then materialize it in a number of formats – paper printed copies, HTML web pages, Braille, audio, and so on. And you can easily present the information in a number of different styles to suit different audiences, such as large type for the sight impaired and highly colorful for teenagers. This is not possible with WYSIWYG editors such as Word, where the author applies specific formatting instructions when creating the content.

DocBook defines its set of tags normatively using a DTD. The DocBook project started in 1999, too early for XML Schema to be used, though there is an “experimental W3C XML Schema” as well as experimental RELAX NG, RELAX, and TREX schemas.14

The online DocBook15 includes a simple sample XML file that conforms to the DocBook DTD; see Example 1-7.

Example 1-7   DocBook Example Document

image

Because this sample is valid according to the DocBook DTD (or Schema), you know that any stylesheets16 designed to work on DocBook will work on this sample. But if you are writing an article or you are writing a stylesheet to format an article, you need to know what these tags mean. For example, what is an indexterm? How should you use this tag in a document? What should a stylesheet do with it? For the meaning of the tags, again you need to look at the human language documentation, not just the DTD (or Schema). DocBook describes the semantics of the indexterm tag and the “Processing expectations” – i.e., what you can expect a stylesheet to do with an indexterm element.

XML-Based Markup Languages – Summary

In Section 1.2 we hope we convinced you that XML is a useful way to represent data. In this section, we convinced you that XML (with XML Schema) must be complemented by some semantic information to make all those tags actually mean something. Everyone who writes an XML document to be shared, exchanged, and/or manipulated by others must also define the structure of the document and the semantics of each of its parts (at least, it’s hard to imagine an XML document simple enough that it would not require any such external definition). That is, every XML author or consumer requires an XML-based markup language definition. Many make up their own for a limited domain of use. The languages we have described in this section are just the better-known standard ones.

1.4 XML Data

There is an old Indian story about six blind men who stumble across an elephant. Each man reaches out and touches a different part of the elephant. One grabs the tail and says he has found a length of string; one touches a leg and declares he has found a tree trunk; a third feels the side of the elephant’s body and is convinced he is standing in front of a wall; and so on. Similarly, there are many kinds of XML data, each with its own characteristics and uses. When a developer or software user talks about “XML data,” often she means just one of these kinds of data. Like the blind men in the story, she may be convinced that hers is the only (important) kind of XML data that exists.

In this book, when we talk about “Querying XML,” we will be clear about what applies to all XML (the whole elephant) and how the different kinds of XML (the tail, the trunk, the body) are treated. When we talk about processing (and in particular querying) XML data in this book, we consider three broad categories of XML data: structured, unstructured, and messages.17

1.4.1 Structured Data

The movie sample we used in Example 1-4 is an example of structured data. All the data in movie is in small, well-defined chunks (givenName, familyName, …), and there are some obvious tree-structure (or parent-child) relationships (e.g., producer naturally breaks down into givenName, familyName, and otherNames). Other examples of structured data include purchase orders, library catalogs, parts inventories, and payroll records. Structured data is often managed in a persistent store, such as a database.

1.4.2 Unstructured Data

For our purposes, unstructured data is data with significant amounts of text. Examples include a Microsoft Word document, an e-mail, and a technical manual. The term unstructured is misleading – all documents have some structure, even if it’s just the structure that’s implicit in, e.g., punctuation marks.18 Using XML to represent an unstructured document allows you to add structure and/or formalize the existing structure. You can also employ XML to mark up unstructured data for presentation, but this should be avoided – in general, you should use XML for semantic markup and leave it to a reporting/publishing tool (such as XSLT) to map “meaning” into presentation.

1.4.3 Messages

An XML message is typically a small, well-defined piece of data passed from one application to another, possibly as (or in) a stream. This is an increasingly popular use of XML, enabling application integration and web services. Messages are usually highly structured, but they are different from most structured data because they generally need to be queried one at a time, possibly in a stream, and there is generally a requirement to process many (possibly many thousands) per second. Messages are generally created, consumed, and disposed of on the fly, with no permanent storage and no need for updates.19

1.4.4 XML Data – Summary

Table 1-2 summarizes the characteristics of the three kinds of data we have discussed so far.

Table 1-2

XML Data

image

1.5 Some Other Ways to Represent Data

XML is not the only way to represent structured or unstructured data. In this section, we discuss some other popular ways to represent data and compare them with XML.

First, we discuss SQL, which is currently the most prevalent way of representing structured data. Then we look at some of the presentation markup languages, which describe unstructured and semistructured data with an emphasis on presentation. Last, we look at a couple of XML’s closest relatives, SGML and HTML.

1.5.1 SQL – Structure Only

SQL – the SQL Query Language – has been the main way of storing and querying structured data for several decades. More recently, the SQL world has embraced the object-relational data representation and has expanded its scope to include unstructured data such as text, AVI (audio, video, image), and spatial data.

In a relational database, records become rows in a table, and fields become the cells of the table. Our movie example might be represented as in Figure 1-1.

image

Figure 1-1 movie, SQL (Relational) Representation.

Figure 1-1 shows six relational tables. Relational tables are built as columns and rows. Only the rows pertaining to the movie “An American Werewolf in London” are shown.

The first table, MOVIES, has a column for each simple, nonrepeating field in the movie record. It also has an ID field. It is common to give a relational table an extra column that is a unique identifier, or “primary key.” With the ID column, we can easily refer to any row in the table (any movie). The table MOVIES achieves field separation and naming, just as XML does. SQL databases have a data dictionary that maps the data, describing the columns that make up each table and the type of each column. But the traditional relational table is flat – it cannot directly represent the tree structure we saw in the XML examples. How can we represent a field such as director – which is made up of several fields – relationally? We create a new table, DIRECTORS, with an ID column, and we reference the “John Landis” ID in the MOVIES table. This strategy will not work for producer, since there can be more than one producer for a given movie. Creating a producers table helps, but we now have two producer IDs for “An American Werewolf in London.”20

To handle repeating fields we need a join table such as MOVIES-PRODUCERS, which maps records in MOVIES to records in PRODUCERS via their primary key fields. We have handled cast in the same way – although there is only one cast member in our sample fragment, we must be able to represent many cast members per movie. Since cast includes a field called character, it is easy to imagine character as another set of subfields (givenName, familyName), which would require yet more tables.

SQL tables do not represent hierarchical structures as readily as XML. Figure 1-1 represents all the data and relationships that Example 1-4 does, but we needed to create six tables and do some design work to achieve that. The SQL world has addressed the limitations of two-dimensional tables in several ways:

1. Subtables: Many modern relational databases allow “the thing in a cell of a table” to be another table (subtable) or an array.

2. Object-relational: An object-relational database allows “the thing in a cell of a table” to be an object, not just a field. A field is a single value, whereas an object can have a complex type made of several values.

Even with the power of subtables and objects, it is more natural to represent hierarchical structure as XML. And XML can be more flexible – it is not essential to have a DTD or an XML Schema, so you can create complex fields, repeating fields, and arbitrarily deep hierarchical structure on the fly (though some would say this is not a good thing). SQL, on the other hand, can represent more complex relationships quite easily, while XML has to massage everything into a tree structure (not all data is naturally tree-shaped). SQL can handle constraints, such as “every movie must have at least one producer,” with which XML is still struggling. And SQL has the notions of transactions and updates, which the XML world (at the time of writing) has only just begun to consider. Add to that the availability and maturity of robust, scalable SQL databases, indexing technology, expertise, and tools, and you can see why SQL is still the preferred way to structure, store, and query data in many applications.

1.5.2 Presentation Languages – Presentation Only

There is a family of markup languages that deal only with presentation and have nothing to say about structure or semantics. These languages allow the creator of the content (generally documents) to dictate exactly how the text should appear, first on a printed page and later on a computer screen.

roff, troff, groff

troff was written in 1973 by Joe Ossana. troff is a typesetting program that takes as input a text file containing a mix of content and markup (in troff format) and outputs a file that can produce a formatted, paginated printed document on a typesetter. Originally the output of troff would drive only a Graphic Systems CAT typesetter; it was modified in 1979 by Brian Kernighan to work with any typesetter.

Example 1-8 gives the flavor of a troff file – it is somewhat verbose, not terribly human-readable, but gives you complete control over the presentation of text, troff was modeled on the earlier roff (run-off). GNU has produced a C++ version of roff called groff.

Example 1-8   troff Markup to Start a New Paragraph

image

TeX/LaTeX

TeX21 is a macro-based text formatting language produced by Donald Knuth. Disappointed with the quality of the typesetting in his Art of Computer Programming,22 Knuth started writing TeX in 1978. It quickly became popular enough to displace troff in the technical typesetting community. TeX gives the author complete control over typesetting presentation and is especially useful for producing documents with specialized formatting requirements, such as scientific and mathematical journals and textbooks.

In 1984, Leslie Lamport wrote LaTeX,23 a document-preparation system layered on top of TeX. LaTeX makes TeX more accessible to authors and has been adopted as a standard in many technical publishing houses.

PostScript

PostScript is a page-description language – a programming language for printing graphics and text – developed by Adobe in 1985. Today, PostScript is the de facto standard for communicating with printers. Example 1-924 is a PostScript program to print “Hello, world!” in Times-Roman, 20 points, in the lower left corner of the page.

Example 1-9   PostScript Markup to Print Hello, world!

image

PDF

PDF – Portable Document Format – is the de facto standard for exchanging electronic documents. The PDF format is owned and developed by Adobe. It is purely a presentation format – like PostScript25 and troff, PDF files represent text and graphics precisely, but it makes no attempt to be human-readable or machine-processable. PDF has been called a “paper format” – even though a PDF document is a file, it has many of the characteristics of a printed page. Sometimes this is desirable – e.g., PDF files can be protected by digital signature to preserve the integrity of their contents. Clearly this is an advantage if you are dealing with legal contracts or other critical information. On the other hand, PDF documents are notoriously difficult to manipulate and process – even editing a PDF document is hard. In general, PDF is used as an end format – that is, data is stored, processed, and managed in some other format (such as XML) and converted to PDF for printing and/or publication.

1.5.3 SGML

SGML is the Standard Generalized Markup Language, ISO standard 8879.26 SGML and XML are closely related – in fact, (almost) every XML document is a valid SGML document, since XML was originally born as an SGML profile, or subset.27

SGML introduced a number of features that continue to be important in XML:28

• Descriptive markup – the idea that markup should not be procedural (as in the presentation languages in Section 1.5.2). Rather, markup should be descriptive. This distinction is important in the development of SGML (and later XML) as a language that is independent of any platform or application.

• Document type – SGML introduced the notion of a document type and was the first language to define a DTD (Document Type Definition). The document type is important for defining the structural constraints on a document. This notion carried over into XML Schema as a complex type.

• Data independence – one of the primary goals of SGML was to enable faithful sharing of documents across different hardware and software platforms. One concrete way this was achieved was to introduce the notion of entities, to provide “descriptive mappings for nonportable characters.”

SGML enjoyed some success in the 1990s, mostly in places with high-end document processing and publishing requirements, such as the aircraft industry (for aircraft maintenance manuals) and the military. But most people agree that SGML is too complicated for more general use.

1.5.4 HTML

HTML is the one standard that needs no introduction. We are confident that everyone that reads this has read HTML, and almost all have written at least some HTML.

HTML contributed to the Internet boom of the late 1990s by providing a simple, standard markup language that was, like SGML, independent of hardware and software platforms and that separated content from presentation. We emphasize “standard” because in the early days of the Internet it was important to have a standard way to exchange data that could be presented in a rich format by any browser. Unfortunately, the standard defined by the W3C was contaminated by the proprietary extensions of all the major browser vendors during the so-called “browser wars,” leading to the heinous “this page best displayed in …” labels on many websites.

At first glance, HTML looks very similar to XML – it consists of start and end tags that delimit elements, optionally with attributes. But HTML differs from XML in two important ways, one technical and the other conceptual.

First, HTML is much more forgiving (some would say “sloppy”) than XML. For example, a paragraph tag in HTML starts at a paragraph start tag (<p>) and ends at either a paragraph end tag (</p>) or immediately before the next paragraph start tag, whichever comes first. In XML, this construct is not allowed – a start tag with no matching end tag is not valid. Similarly, an empty element in HTML can be represented by a stand-alone start tag (such as the line separator, <br>). Again, this kind of stand-alone marker is not valid in XML – an empty element must be a start tag/end tag pair (<br></br>) or the shorthand empty element representation (<br/>).

Second, XML is all about marking up the meaning of the data, whereas HTML has drifted toward presentation markup. There has been much (sometimes heated) debate over the exact line between semantic and presentation markup – is a “heading” semantics or presentation? But HTML, with its tags for purely formatting markup, such as italics and boldface, has definitely crossed over that line into presentation markup.

Fortunately, both of these differences can be resolved. Most HTML can be turned into valid XML (and not lose its validity as HTML) with some simple cleanup, such as making sure all start tags have a matching end tag. And XML is a markup language (or a language for markup languages) – you can use XML to represent any kind of markup, semantic or representation (or syntactic or anything else). XML is particularly effective when used to markup the meaning of data, leaving the presentation aspects to some other step, such as applying an XSL stylesheet. But there is no reason why you should not use XML for the mix of semantic and representation markup that is HTML. XHTML29 does just that – XHTML defines a variant of HTML that is also valid XML. The XHTML 1.0 spec actually defines several flavors of XHTML. XHTML transitional is very close to HTML, but it is also valid XML. XHTML strict goes further, eliminating representation markup for fonts, colors, and other formatting (in favor of CSS, Cascading Style Sheets).

Today, all the leading browsers will display not only HTML and XHTML, but also XML with an associated XSL stylesheet. We believe that over the next few years, all new documents on the web will be either XHTML or XML.

1.6 Chapter Summary

In this chapter, we introduced XML, the Extensible Markup Language. XML is common enough that we expect everyone reading this book to have some familiarity with it, so we used this chapter to put XML in some historical and technical context. We discussed what markup is and what it’s good for. Then we looked at a number of different kinds of XML data, to show where and how XML is useful. And we looked briefly at some of the other ways to represent data and compared them with XML. In the next chapter, we discuss querying. Once we have laid the foundations with discussions of XML and querying, we can introduce the title topic of this book – querying XML.


1See also Chapter 5, “Structural Metadata.”

2Example 1-5 is one possible DTD that describes the movie document. When you create a DTD based on a sample document, you can’t tell which of the elements in the sample are optional or which elements may occur more than once. Some elements may be optionally present in a document but not present in your sample document. If your document includes attributes, you can’t tell which are IDs or IDREFs, and you can only guess at attributes’ enumerated values.

3Though the DTD does not give us data type information, it does give us the type of the document, in the sense of Schema’s Complex Types.

4A DTD may include some data type information for attributes, such as ID/IDREF type and enumeration.

5W3C Schema for Schemas, available at: http://www.w3.org/TR/xmlschema-l/#normative-schemaSchema.

6A well-thought-out element name does add some meaning for a human reader with some knowledge of the data and/or the domain. But without a defined vocabulary, it’s only a very little meaning, and of course that meaning can’t be machine-processed.

7Or a DTD.

8See the W3C Semantic Web Activity at: http://www.w3.org/2001/sw/.

9XBRL specifications and recommendations are available at: http://www.xbrl.org/SpecRecommendations/.

10The Dublin Core Metadata Initiative, http://dublincore.org/.

11http://dublincore.org/documents/dcmi-terms/

12In fact, Dublin Core defines an RDF Schema as well as XML Schemas to describe its elements. See http://dublincore.org/schemas/.

13Norman Walsh and Leonard Muellner, DocBook: The Definitive Guide (Sebastopol, CA: O’Reilly, 1999). See http://www.docbook.org/.

14See http://www.docbook.org/ for details and links.

15http://www.docbook.org/tdg/index.html

16A stylesheet is a mechanism for programmatically transforming an XML document into another format, such as HTML. See Chapter 7.

17The alert reader might question this breakdown, for it appears to mix two dimensions: structured vs. unstructured and persistent vs. transient data. However, we think this does represent the three main uses of XML – see the following few sections.

18Some people use a third category, “semistructured documents,” to refer to documents that mix structured and unstructured elements.

19This is not always the case. Some messages may be stored for long periods before being consumed, but they are generally not queried many times or updated.

20We could have movie IDs in the PRODUCERS table, but we assume a producer produces several movies.

21Donald E. Knuth, The TeXBook (New York: Addison-Wesley Professional, 1984).

22Donald E. Knuth, The Art of Computer Programming, Volumes 1–3 (New York: Addison-Wesley Professional, 1998).

23Leslie Lamport, LaTeX: A Document Preparation System (New York: Addison-Wesley Professional, 1994).

24You can find this example in many places on the web, e.g., http://docs.mandragor.org/files/Programming_languages/Forth_And_PostScript/First_Guide_To_PostScript_en/text.htm.

25Some people even refer to PDF as “smart PostScript.”

26ISO 8879:1986, Information processing – Text and office systems – Standard Generalized Markup Language (SGML) (Geneva, Switzerland: International Organization for Standardization, 1986). Available at: http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=16387.

27For a listing of the SGML declaration for XML and a description of the differences between SGML and XML, see: James Clark, Comparison of SGML and XML (Cambridge, MA: World Wide Web Consortium, 1997). Available at: http://www.w3.org/TR/NOTE-sgml-xml.html.

28C. M. Sperberg-McQueen and Lou Burnard (eds.), A Gentle Introduction to SGML (The Text Encoding Initiative, 1994). Available at: http://www.isgmlug.org/sgmlhelp/g-index.htm.

29XHTML 1.0 The Extensible HyperText Markup Language (Second Edition): A Reformulation of HTML 4 in XML 1.0 (Cambridge, MA: World Wide Web Consortium, 2002). Available at: http://www.w3.org/TR/xhtmll/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.255.250