In this section, we will cover advanced topics of XML markup like entities, mixed and content specifications, and the remaining attribute types. We begin with how and why to add character references to your XML document.
XML uses the Unicode character set. Unicode is a standard that allows characters from all existing, and even ancient, languages. If your keyboard does not have a key for a particular international character you can insert that character into an XML document using a character reference.
There are two formats for a character reference: decimal and hex. The decimal representation is
CharRef :== '&#'[0-9]+ ';'
For example:
<P> Here is a special character: © </P>
In Unicode, as in Latin1, position 169 is the copyright symbol (©).
CharRef :== '&#x'[0-9a-fA-F]+ ';'
Here is an example of a character reference in hex:
<P> Here is another special character: ¶ </P>
In decimal, this character is position 182, which is a paragraph symbol.
Just as character references are abbreviations for Unicode characters, the XML specification also allows you to define your own abbreviations called entities and refer to them with entity references.
Entities and entity references are techniques for enabling reuse of both content and markup in your XML documents. Entities can be confusing because there are many different categories of entities; however, after you clearly understand the various categories and to what category a particular entity belongs, they make sense. Figure 4.1 depicts the hierarchy of entity categories.
Before demonstrating each category, let's look at Table 4.1, which provides definitions of each category.
Note
An internal entity must be a parsed entity.
In its simplest form (an internal, general parsed entity) an entity is just an abbreviation for larger text. For example, an entity dtd could be used to abbreviate the phrase document type definition.
You declare entities in your document type definition like this:
<!ENTITY dtd "document type definition">
Another way to think about an entity is as a box with a label. The label is the entity's name. The contents of the box can be text or data. If the content of the entity is text, the standard calls this a parsed entity. Because the content is text it may also contain markup. For example:
<!ENTITY line "<P>This is a parsed entity. </P>">
An entity can be fetched from an external source specified by a URI. This is called an external entity. Here's another example:
<!ENTITY intro SYSTEM "http://www.gosynergy.com/intro.xml">
In an XML document, an entity is referred to via an entity reference. Here is the formal definition of an entity reference.
EntityRef :== '&' Name ';'
Table 4.2 shows the predefined entities for the characters used to delineate markup
Entity Reference | Character |
---|---|
& | & |
< | < |
> | > |
' | ' |
" | " |
Note
Unlike HTML, which has many predefined entities, the entities listed in Table 4.2 are the only ones predefined in XML.
Here is a more complete example that uses a parsed general entity:
<!DOCTYPE BOOK [ <!ENTITY publisher "SAMS Professional publishing"> ]> <BOOK> <PUBLISHER> &publisher; </PUBLISHER> <P> Welcome to this book published by &publisher;. This &publisher; produces numerous professional titles every year. </BOOK>
There are also unparsed entities for data such as images:
<!ENTITY image SYSTEM "http://www.wyweb.com/myhouse.gif" NDATA GIF>
Notice that unparsed entities are differentiated with the keyword NDATA followed by a notation name for that data. The notation name must be a declared notation. You declare notations in the DTD similar to the way you declare elements. The declaration of the GIF notation would be
<!NOTATION GIF SYSTEM "apps/imgviewer.exe">
So far we have seen parsed and unparsed entities and external and internal entities. There is one more distinction that can be applied to entities: general or parameter. So far, we have seen only general entities. Remember, a general entity is an entity used for text replacement in a document instance.
A parameter entity is an entity that is only used in a DTD. It is differentiated in both its declaration and reference by a % symbol.
Here is an example of an internal parameter entity declaration:
<!DOCTYPE EXAMPLE [ <!ENTITY % obj "<!ELEMENT OBJECT (#PCDATA)>"> %obj; ]>
Parameter entities have different rules for an internal DTD (called an internal subset) versus an external DTD (called an external subset). In an internal DTD you can only have whole declarations (as shown above). In an external DTD you can have a parameter entity for partial declarations. This is to make parsing for non-validating parsers (which must parse the internal subset) easier—for example:
<!ENTITY % nameAtt "name CDATA #REQUIRED"> <!ATTLIST folder %nameAtt;> <!ATTLIST bookmark %nameAtt; type CDATA #IMPLIED>
External parameter entities allow you to reuse common declarations. For example, an employee element may be used in several different markup languages across the business. Here is an example of an external parameter entity:
<!ENTITY % employee SYSTEM "http://www.super.com/xml/employee.dtd"> ... %employee;
Note
Markup may not span entity boundaries. The following is illegal:
<!DOCTYPE SAMPLE [ <!ENTITY start-tag "<title>This is very"> <!ENTITY end-tag "illegal. </title>"> ]> &start;&finish;
I'd like to make the following points about entities:
In an attribute value you can use an internal, general entity. For example:
<!ENTITY favrest "Tippy's Taco house"> <!ATTLIST menu date CDATA #REQUIRED restaurant CDATA #FIXED "&favrest;">
Entities must be declared before they are used.
Entities are a concept that will take experience to master. A good way to speed up your learning curve on entities is to examine DTDs and documents written by others. See http://xml.org for a repository and catalog of XML documents and DTDs.
In Chapter 2, "Parsing XML," you learned how to declare attributes in a document type declaration using an attribute-list declaration. For example:
<!ELEMENT CONTACT (#PCDATA)> <!ATTLIST CONTACT EMAIL CDATA #REQUIRED>
Attributes have types that enforce both lexical and semantic constraints. Table 4.3 provides a summary of the attribute types.
Many of the definitions in Table 4.3 specify using either a name or a name token. A name is any valid XML name. An XML name must begin with a letter or an underscore, followed by any number of letters, digits, hyphens, underscores, periods, or colons. Colons are now used to denote namespaces (discussed next). XML names are used for all element and attribute names. For example:
<!ELEMENT BODY (#PCDATA)>
An NmToken or name token is any combination of legal name characters for XML names. In other words, all XML names are name tokens, but not all name tokens have XML names. Here are some sample name tokens:
.1.a.name.token.but.not.a.name 234_also_a_name_token_but_not_a_name A_name_token_and_a_name
In Chapter 2, I covered the most common attribute data types (CDATA and enumeration); now I will both define and demonstrate all the available attribute types.
CDATA is the simplest type of attribute. It allows any character data except <, & (unless it starts a reference), or the quotation character used to surround the string. For example:
<!ATTLIST QUOTE DATE CDATA #REQUIRED> ]> <QUOTE DATE="February 9, 1999">... </QUOTE>
An Enumeration type allows an attribute to take one name token among a choice of any number of name tokens. For example:
<!ATTLIST CHOICE (option1|option2|option3) #REQUIRED>
Name token (NMTOKEN) attributes are similar to CDATA except that they are restricted to valid name tokens (only name characters). An empty string is not a valid name token. Also, a name token cannot have whitespace. For example:
<!ATTLIST QUOTE DATE NMTOKEN #REQUIRED> ... ]> <QUOTE DATE="1999-02-09">... </QUOTE>
The NMTOKENS declaration allows an attribute value to be one or more NMTOKENS separated by a space.
An ID attribute allows you to name a particular element so that it may be referred to later using an IDREF attribute. These ID attributes will also be used with XLINKS, which are discussed later. IDs are XML names. Every element can have at most one ID. All IDs specified in an XML document must be unique. IDREF attributes must refer to an ID in the document. Also, if you use the IDREFs designation, you may have an attribute that has one or more IDREFs as its value. For example:
<!DOCTYPE PAPER [ <!ELEMENT SECTION (TITLE, PARAGRAPH*)> <!ATTLIST SECTION SEC-ID ID #IMPLIED> <!ELEMENT CROSS-REFERENCE EMPTY> <!ATTLIST CROSS-REFERENCE TARGET IDREF #REQUIRED> ... ]> <PAPER> <SECTION SEC-ID="java.features"><TITLE> Java's Best features </TITLE> ... </SECTION> ... To refresh your memory, see the section titled <CROSS-REFERENCE TARGET="java.features" /> </PAPER>
An ENTITY attribute is used to refer to an unparsed external entity.
<!DOCTYPE BOOKREVIEW [ ... <!ATTLIST BOOK COVER ENTITY #REQUIRED> <!NOTATION GIF SYSTEM "apps/gifview.exe"> <!ENTITY java-book1 SYSTEM "http://www.sellbooks.com/java/book1.gif" NDATA GIF> ]> <BOOKREVIEW> <BOOK cover = "java-book1">... </BOOK> </BOOKREVIEW>
You may also declare an attribute to refer to one or more entities using the ENTITIES designation.
A NOTATION attribute type is used to specify that an attribute value is one of several declared NOTATIONS. For example:
<!ATTLIST COVER_IMG type NOTATION (GIF|JPEG|BMP) "GIF">
After you declare your attributes and assign them to an appropriate type, you can use attributes in your document. The values assigned to those attributes are modified by a process called "normalization," which is discussed next.
Normalization and whitespace handling are detailed processes for handling specific text processing situations. This type of fine granularity is the basis of a good standard.
Element attributes are name="value" pairs; however, the value between the quotes is first passed through a process called normalization. Here are the steps in the normalization process:
Surrounding quotes are stripped out.
Character references are replaced with their corresponding characters. For example, © would be replaced with a copyright symbol.
General entity references are replaced with their corresponding text. This is a recursive process, which means that if the replacement text also contains references, they are replaced, and so on.
Whitespace characters (carriage return, line feed, tab and space) in attribute values are replaced by spaces. Also, the sequence CR-LF is replaced by a single space.
If an attribute type is anything other than CDATA, leading and trailing spaces are removed. Also, if using tokenized types, spaces between tokens are collapsed to a single space.
It is important to remember the distinction between unnormalized attribute value text and attribute value data (after normalization). For example:
<GRAPHIC ALTERNATE-TEXT="This is a picture of a penguin dancing."> The attribute value is normalized to This is a picture of a penguin dancing.
You may remember that in Chapter 2 we contrasted XML to HTML in its treatment of whitespace. Whereas HTML disregarded whitespace, XML preserved whitespace in your document content. To be technically accurate, the specification requires that an XML processor (usually a parser) pass whitespace on to the application (the consumer program of the data). The application then can determine whether whitespace is significant. The specification provides a special attribute called xml:space that can be attached to any element in order to specify the proper treatment of whitespace to the application. Here is the form of the xml:space attribute:
<!ATTLIST elemName xml:space (default | preserve) 'preserve'>
The elemName in the general form is any element you want to attach the element to. By convention, the attribute applies to that element and its children elements. The value 'preserve'specifies that the application should preserve all whitespace. The value 'default' indicates that the application's default processing for whitespace is acceptable (whatever that may be).
Another aspect of handling whitespace across heterogeneous platforms is the processing of end-of-line characters. The problem is that there are three widespread methods for handling end-of-line: Mac OS uses a carriage return (CR), UNIX uses a line feed (LF) and Windows uses a carriage return line feed (CR-LF) sequence. The XML specification requires that the XML processor convert any of the stated conventions to a single LF to signify end-of-line.
As previously stated, element type declarations start with the literal string <!ELEMENT followed by an element name and then a content specification:
<!ELEMENT html (head, body) >
The content specification can be one of four types: EMPTY, ANY, mixed content, or element content. The element content model is the most common. The EMPTY content model is for empty elements.
The ANY content model allows an element to contain any character data or child elements. This is a completely unstructured content specification and therefore is rarely used.
A mixed content element may contain character data, optionally interspersed with child elements. Here is the grammar for the mixed content specification:
Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name) * S? ')*' | '(' S? '#PCDATA' S? ')'
This grammar states that you can either have the literal #PCDATA followed by zero or more child element names or just have the literal #PCDATA by itself. PCDATA stands for parsed character data.
Example 1: <!ELEMENT NAME (#PCDATA) > Example 2: <!ELEMENT paragraph (#PCDATA|quote|reference)* >
A CDATA section is used in a document when you do not want the content to be treated as markup. The most obvious example of using a CDATA section would be to pass XML markup into an application (instead of having it parsed as markup). A CDATA section starts with the string <![CDATA[ and ends with the string ]]>. For example:
<![CDATA[<TITLE> an XML example </TITLE>]]>
Another possible use of a CDATA section would be to pass source code to an application without having to use character references for reserved characters like the < or > symbol.
Conditional sections can only occur in the external subset of the document type declaration and in external entity references from the internal subset. A conditional section allows you to turn on and off a series of markup declarations. There are two keywords used with conditional sections: INCLUDE and IGNORE.
A conditional section may include one or more complete declarations, comments, processing instructions, or nested conditional sections. If the keyword used is INCLUDE, the section is processed. If the keyword is IGNORE, the section is not processed.
Here is an example of using conditional sections:
<! [INCLUDE [ <!ELEMENT article (title, section+, references*)> ] ]> <![IGNORE [ <!ELEMENT article (title, section+)> ]]>
This is very useful for turning on and off parts of a DTD during development. You can use an entity for the keyword of a conditional section. The processor will replace the reference before determining whether it should include or ignore the section. For example, the document could be rewritten like this:
<!ENTITY % editor "INCLUDE"> <!ENTITY % author "IGNORE"> <! [%editor [ <!ELEMENT article (title, section+, references*)> ] ]> <![%author [ <!ELEMENT article (title, section+)> ]]>
A processing instruction is used to pass additional information to one specific processing application without changing the way the document is processed by other applications. In general, processing instructions should be used infrequently. The format of a processing instruc tion is the literal <? followed by a name (the name of the target application), followed by any text and ending with the literal string ?>.
Note
The name of the application in a processing instruction may not be any variation of the letters XML.
Here is an example to change the font of the first word of a paragraph. You may have to do this if you are using someone else's DTD that does not have markup for something you want to do. For example:
<SECTION> The man stood on the beach. <p> <?EZFormat Font="24Pt"?> Hey! <?EZFormat endFont ?> </SECTION>
Another reason for processing instructions could be for sending special commands to a CGI program processing the XML prior to passing it to a client.
XML uses a special processing instruction for attaching XSL stylesheets to a document instance.
<?xml:stylesheet href="http://www.mystuff.com/memo.xsl" type="text/xsl" ?>
Last, remember that the XML declaration is a form of processing instruction.
In the primer on XML, I discussed the XML declaration and stated that it contained some literal text <?xml, followed by version information, an optional encoding declaration, an optional standalone document declaration, and the literal text ?>. I will now examine the two optional parts of the XML declaration: the encoding declaration and the standalone document declaration.
The encoding declaration specifies the character set encoding for the following document. The earlier section "Character References" contains a note that defines character set encoding and how it relates to both characters and character sets. The specification requires all XML proces sors to support both UTF-8 and UTF-16 encoding. Support of all other encodings is optional. In the absence of an encoding declaration or a byte order mark (this allows auto-detection of a UTF-16 encoded file), the encoding must be UTF-8. Because ASCII is a subset of UTF-8, ordinary ASCII files do not need an encoding declaration. Here are some examples of encoding declarations:
<?xml version='1.0'encoding='UTF-16'?> <?xml version='1.0'encoding='ISO-10646-UCS-2'?>
A list of Internet-supported character set names can be retrieved from
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
Note
Any external parsed entity may begin with a text declaration. A text declaration is identical to an XML declaration with the exception that the version declaration is optional and minus the standalone document declaration. For example:
<?xml encoding='UTF-16'?>
The standalone document declaration is only used rarely and is not recommend for general use. As we stated previously, a DTD can be composed of both an external subset and an internal subset. An external subset is stored elsewhere and referenced via a URI. A standalone document declaration declares whether an application needs to fetch the external subset of the DTD to process the document correctly. For example:
<?xml version="1.0" standalone="yes" ?> <!DOCTYPE HTML SYSTEM http://www.xmlstuff.com/html.dtd> <HTML> ... </HTML>
This example would state to a processor that the client does not need to fetch the DTD to properly process the document. It is important to note that a document is not valid unless both the external subset and internal subset of a DTD has been processed.
Another scenario for using the standalone document declaration is if multiple programs process a document but only the first one validates the document. All ensuing programs could safely skip that step.
The XML grammar is specified using an Extended Backus-Naur form (EBNF) notation. Using an EBNF defines a context-free grammar—a grammar that is independent of the context in which it is used. The notation for definitions in the grammar is
symbol ::= expression
where expression defines the rule for creating the symbol on the left-hand side. This formal grammar ensures that there is no ambiguity in the XML syntax. All the legal expressions are precisely defined via EBNF.
Note
It is important to keep in mind that this section refers to EBNF syntax and not XML syntax. The purpose for reviewing EBNF is to give you the ability to consult the XML specification when necessary.
EBNF statements are also called production rules, because they express the way in which valid symbols are constructed or produced using other symbols or specific fixed strings.
Table 4.4 shows EBNF notations in the grammar and their meaning.
Here is a snippet of the XML grammar in the XML specification:
elementdecl ::= '<!ELEMENT' S Name S contentspec S? '>' S ::= (#x20 | #x9 | #xD | #xA)+ Name ::= (Letter | '_' | ':') (NameChar) contentspec ::= 'EMPTY' | 'ANY' | Mixed | children children ::= (choice | seq) ('?' | '*' | '+') ? cp ::= (Name | choice | seq) ('?' | '*' | '+') ? choice ::= '(' S? cp ( S? '|' S? cp )* S? ')' seq ::= '(' S? cp ( S? ',' S? cp )* S? ')' Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* ? ')*' | '(' S? '#PCDATA' S? ')'
Note
See Letter and NameChar rules in the XML Recommendation. They occupy several pages and were left out for brevity.
Here is a partial translation of the production rules:
An element declaration is the literal <!ELEMENT, followed by a space, a legal XML name, a symbol called contentspec, (optionally) a space, and finally the literal >.
A contentspec is either the literal EMPTY or ANY, or the translation of the symbol Mixed or the symbol children.
A children symbol is translated as either a choice or a sequence followed optionally by an occurrence indicator, which is the literal ?, *, or +.
A choice symbol is translated as the literal (, an optional space, a symbol called cp (a con- tent particle), zero or more literals | with more content particles (and optional space), and a literal).
18.119.131.178