Chapter 20. XML Jargon Demystifier™ Introductory Discussion

  • Structured vs. unstructured

  • Tag vs. element

  • Document type, DTD, and markup declarations

  • Schema and schema definitions

  • Documents and data

 

“When I use a word,” Humpty Dumpty said, in a rather scornful tone, “it means just what I choose it to mean, neither more nor less.”

 
 --Lewis Carroll, Through the Looking Glass

One of the problems in learning a new technology like XML is getting used to the jargon. A good book will hold you by the hand, introduce terms gradually, and use them precisely and consistently.

Out in the real word, though, people use imprecise terminology that often makes it hard to understand things, let alone compare products. And, unlike authors,[1] they sometimes just plain get things wrong.

For example, you may see statements like “XML documents are either well-formed or valid.” As you’ve learned from this book, that simply isn’t true. All XML documents are well-formed; some of them are also valid.

In this book, we’ve taken pains to use consistent and accurate terminology. However, for product literature and other documents you read – even Office help! – the mileage may vary. So we’ve prepared a handy guide to the important XML jargon, both right and wrong.

Structured vs. unstructured

Structured is arguably the most commonly used word to characterize the essence of markup languages. It is also the most ambiguous and most often misused word.

There are four common meanings:

structured = abstract

XML documents are frequently referred to as structured while other text, such as renditions in notations like RTF, is called unstructured. Separating “structure from style” is considered the hallmark of a markup language. But in fact, renditions can have a rich structure, composed of elements like pages, columns, and blocks. The real distinction being made is between “abstract” and “rendered”.

structured = managed

This is one of the meanings that folks with a database background usually have in mind. Structured information is managed as a common resource and is accessible to the entire enterprise. Unfortunately, there are also departmental and individual databases and their content isn’t “structured” in quite the same sense.

structured = predictable

This is another alternative for relational database people. Structured data is captured from business transactions, comes in easily identified granules, and has metadata that identifies its semantics. In contrast, freeform data is normally buried in reports, with no metadata, and therefore must be “parsed” (by reading it!) to determine what it is and what it means. If an essentially freeform document has islands of structured data within it, the document might be termed semi-structured. See 20.8, “Documents and data”, on page 436 for more on this.

structured = possessing structure

This is the dictionary meaning, and the one used in this book. There is usually the (sometimes unwarranted) implication that the structure is fine-grained (rich, detailed), making components accessible at efficient levels of granularity. A structure can be very simple – a single really big component – but nothing is unstructured. All structure is well-defined and “predictable” (in the sense of consistent), it just may not be very granular.

These distinctions aren’t academic. It is very important to know which “structured” a vendor means.

What if your publishing system has bottlenecks because you are maintaining four rendered versions of your documents in different representations? It isn’t much of a solution to “structure” them in a database so that modifying one version warns you to modify the others.

You’ll want to have a single “structured” – that is, abstract – version from which the others can be rendered. And if you find that your document has scores of pages unrelieved by sub-headings, you may want to “structure” it more finely so that both human readers and software can deal with it in smaller chunks.

Keep these different meanings in mind when you read about “structured” and “unstructured”. In this book, we try to confine our use of the word to its dictionary meaning, occasionally (when it is clear from the context) with the implication of “fine-grained”.

Tag vs. element

Tags aren’t the same thing as elements. Tags describe elements and delimit them.

In Figure 20-1 the pet carrier, metaphorically speaking, is an element. The contents of the carrier is the content of an element. It is bounded by two tags.

Tags aren’t elements!

Figure 20-1. Tags aren’t elements!

The start-tag, at the left, describes the element. It contains three names:

  • The element-type name (dog), which says what type of element it is.

  • A unique identifier, or id (Spike), which says which particular element it is.

  • The name of an attribute that describes some other property of the element: weight="8 lbs".

The end-tag, at the right, marks the end of the element. It repeats the element-type name.

When people talk about a tag name:

  1. They are referring to the element-type name (in this case, dog).

  2. They are making an error, because tags aren’t named.

And when they talk about an element name:

  1. They are again referring to the element-type name.

  2. They are again making an error, because an element is named by its unique identifier (in this case, Spike).

Content

We know that formally the content of an element is what occurs between the start-tag and the end-tag. Therefore, the content of a document is what occurs between the first start-tag and the last end-tag of the document.

So when people say that “XML separates content from presentation”, they really mean that XML lets you separate abstract data (in the document) from rendition information (in a stylesheet).

When they say “an XML document has content and structure”, they mean it has data and structure.

People also refer to “content” or “XML content” as a commodity: “Our website has dynamic, involving, interactive, rich, multimedia XML content”. We do that as well in this book when the context is clear (but without the adjectives!).

Some people – not us – also use the term “content” when making a principled distinction between data intended for people (“content”) and data intended for machines (“data”).

Document type, DTD, and markup declarations

A document type is a class of similar documents, like telephone books, technical manuals, or (when they are marked up as XML) inventory records.

A document type definition (DTD) is the set of rules for using XML to represent documents of a particular type. These rules might exist only in your mind as you create a document, or they may be written out.

Markup declarations, such as those in Example 20-1, are XML’s way of writing out DTDs.

Example 20-1. Markup declarations in the file greeting.dtd.

<!ELEMENT greeting (salutation, addressee) >
<!ELEMENT salutation (#PCDATA) >
<!ELEMENT addressee  (#PCDATA) >

It is easy to mix up these three constructs: a document type, XML’s markup rules for documents of that type (the DTD), and the expression of those rules (the markup declarations). It is necessary to keep the constructs separate if you are dealing with two or more of them at the same time, as when discussing alternative ways to express a DTD. But most of the time, even in this book, “DTD” will suffice for referring to any of the three.

Schema and schema definition

The programming and database worlds have introduced some new terminology to XML.

We now speak of a document type as a kind of schema, a conception of the common characteristics of some class of things. Similarly, a DTD is a schema definition, the rules for using XML to represent documents conforming to the schema.

Schema definitions are invariably written out in a notation called a schema definition language, or simply a schema language. And as with DTDs, the word “schema” can serve for all these purposes when there is no ambiguity.

Document, XML document, and instance

The term document has two distinct meanings in XML.

Consider a really short XML document that might be rendered as:

Hello World

The conceptual document that you see in your mind’s eye when you read the rendition is intuitively what you think of as the document. Communicating that conception is the reason for using XML in the first place.

In a formal, syntactic sense, though, the complete text (markup + data, remember) of Example 20-2, is the XML document. Perhaps surprisingly, that includes the markup declarations for its DTD (shown in Example 20-1). The XML document, in other words, is a character string that represents the conceptual document.[2]

Example 20-2. A greeting document.

<?xml version="1.0"?>
<!DOCTYPE greeting SYSTEM "file://greeting.dtd">
<greeting>
<salutation>Hello</salutation>
<addressee>World</addressee>
</greeting>

In this example, much of that string consists of the markup declarations, which express the greeting DTD. Only the last four lines describe the conceptual document, which is an instance of a greeting. Those lines are called the document instance.

That term gets flipped around when schema languages are involved. Unlike DTD declarations, schema languages are XML-based, so a schema definition must be stored as an XML document in its own right (a schema document). That means an instance of a schema definition is a separate document, so it is known as an instance document.

What’s the meta?

Nothing. What did you think was the meta?[3]

There are two “meta” words that come up regularly when computer types talk about XML: metadata and metalanguage.

Metadata

Metadata is data about data. The date, publisher’s name, and author’s name of a book are metadata about the book, while the data of the book is its content. The DTD and markup tags of an XML document are also metadata. If you choose to represent the author’s name as an element, then it is both data and metadata.

If you get the idea that the line between data and metadata is a fluid one, you are right. And as long as your document representation and system let you access and process metadata as though it were data, it doesn’t much matter where you draw that line.

Be careful when talking to database experts, though. In their discipline “metadata” typically refers only to the schema.

Metalanguage

You may hear some DTDs or schemas referred to as languages, rather than document types. HTML is a prominent example. There’s nothing special about them, it is just another way of looking at the way a markup language works.

Remember that an XML document is a character string that represents some conceptual document. The rules for creating a valid string are like the rules of a language: There is a vocabulary of element type and attribute names, and a grammar that determines where the names can be used.

These language rules come from the DTD or schema, which in turn follows the rules of XML. A language, such as XML, which you can use to define other languages (such as DTDs), is called a metalanguage. XML document types are sometimes called XML-based languages.

Documents and data

For many decades, data processing got the big budgets while document processing got a room in the basement with a copying machine. While the data processors relished their importance to the organization, the document processors basked in their importance to humanity. They were preservers of human knowledge, not just high-speed bean counters.

No wonder the two never got along!

Markup languages are changing all that. With XML, documents and databases both store data and can share it, so document processing and data processing can be performed at the same time, by the same people.

It’s all data!

In an XML document, the text that isn’t markup is data. You can edit it directly with an XML editor or plain text editor. With a stylesheet and a rendering system you can cause it to be displayed in various ways.

In a database, you can’t touch the data directly. You can enter and revise it only through forms controlled by the database program. However, rendition is similar to XML documents, except that the stylesheet is usually called something like “report template”.

The important thing is that, in both cases, the data can be kept in the abstract, untainted by the style information for rendering it. This is very different from word processing documents, of course, which normally keep their data in rendered form. Even WordML is a rendition, despite its use of XML.

Data-centric vs. document-centric

Documents, data, and processes are sometimes characterized as “data-centric” in contrast to “document-centric”. Since all XML documents (except empty ones) contain data, these terms are actually a misleading shorthand. Worse, they are applied in two very different contexts:

  • how much the XML resembles relational data; and,

  • whether you have to deal with the whole document at once.

How relational is it?

The data-centric misnomer is common among database hackers trying to describe structures that map easily onto relational tables and primitive datatypes. Structures that don’t are called document-centric.

The intended meaning of data-centric is that the document structure – really element structure, since a document is essentially just the largest element – is fully predictable.

An element has a fully predictable structure if it and its subelements are constrained to contain either:

  • type-sequenced elements (e.g., a sequence of elements of the types: quantity, itemNum, description, price),

  • data characters only (i.e., #PCDATA), or

  • nothing at all.

Fully predictable elements can easily be visualized as forms. A business transaction document such as a purchase order is more likely to be fully predictable than a memo.

In addition to “data-centric”, the misnomer highly structured is sometimes used. However, highly predictable would be more precise, particularly as many documents that aren’t fully predictable are still much more predictable than they are freeform.

How granular is it?

Another (mis)use of data-centric is to characterize the storage and/or access of documents at the level of individual elements, rather than the entire document at once (document-centric). Once again, the usage is misleading because what it describes has nothing to do with data per se, and because it implies a contradiction between data and documents that does not exist.

Document processing vs. data processing

While “data-centric” and “document-centric” aren’t rigorous terms for characterizing information, they are quite meaningful when applied to processing. XML, however, because it can preserve abstract data (like a database) but still be interchanged and processed as a character string (like a document), is starting to break down the historic separation of the two paradigms. Applications can now intermix data processing and document processing techniques to get the job done.

Comparing documents to data

Since documents contain data, what are people doing when they compare or contrast documents and data?

They are being human. Which is to say, they are using a simplified expression for the complex and subtle relationship shown in Table 20-1. They are comparing the typical kind of data that is found in XML and word processing (WP) documents with business process (BP) transactional data (operational data), which usually resides in databases.

Table 20-1. Typical traits of data

 

XML data

BP data

WP data

Presentability

Abstraction

Abstraction

Rendition

Source

Written

Captured

Written

Structure

Hierarchy+ links

Tables

Paragraphs

Purpose

Processing

Processing

Presentation

Location

Document

Database

Document

Note that the characteristics in the table are typical, not fixed. For example, XML data can be a rendition (HTML and WordML are examples). In addition, XML data could:

  • Be captured from a data entry form or a program (rather than written);

  • Consist of simple fields like those in a relational table (rather than a deeply nested hierarchy with links among the nodes); and

  • Be intended for presentation as well as processing.

Caution

Caution

The true relationship between documents and data isn’t as widely understood as it ought to be, even among experts. That is in part because the two domains existed independently for so long. This fact can complicate communication.

And in conclusion

The matrix in Figure 20-2 ties together a number of the concepts we’ve been discussing.

A rendition can be generated from an abstraction

Figure 20-2. A rendition can be generated from an abstraction

The top row contains two conceptual documents, as they might appear in your mind’s eye. Actually, they are two states of the same document. The left column shows the document in its abstract state, while the right column shows it rendered.

The bottom row shows the computer representations of the abstraction and the rendition. The abstraction uses XML notation while the rendition uses HTML. The horizontal arrow indicates that the rendition was generated from the abstraction.

The diagram illustrates some important points:

  • Abstraction and rendition are two presentability states that a document can be in. Renditions are ready to be presented; abstractions aren’t.

  • Renditions can be generated from abstractions automatically.

  • Markup languages can represent both abstractions and renditions; “structuring in XML” is no guarantee that you’ll get an abstraction.

  • The computer representation of a document incorporates two ideas: presentability and notation. In other words, the representation of a document is either an abstraction or a rendition, and is either in an XML-based language or some other notation.



[1] We should be so lucky!

[2] After a program parses the string it usually keeps an object model in memory so that it can navigate and access data directly in terms of the conceptual document structure. During processing it usually updates the object model, then serializes it as the result XML document.

[3] Sorry about that!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.153.224