Advanced Markup

In this section, we will cover advanced topics of XML markup like entities, mixed and content specifications, and the remaining attribute types. We begin with how and why to add character references to your XML document.

Character References

XML uses the Unicode character set. Unicode is a standard that allows characters from all existing, and even ancient, languages. If your keyboard does not have a key for a particular international character you can insert that character into an XML document using a character reference.

Character Representation

It is important to distinguish between the varying facets of representing characters in documents. Each definition represents one facet of such representation:

  • Character— A letter in a language.

  • Glyph— The picture or rendered illustration of the character.

  • Coded character set— An agreed-on mapping of characters to positions in a code space. XML uses the Unicode character set. Unicode supports a base of 1,114,112 positions and each character can be encoded in one or two 16-bit words.

  • Font— A collection of glyphs for a character set.

  • Character set encoding— The final step in using characters in an electronic document is deciding how to represent the numeric positions in the chosen character set in binary form in the file on disk. Simple character sets like ASCII or Latin1 encode using a byte of storage per character, with the value of the byte being the integer value of the position in the character set. This only worked because these character sets only supported 128 and 256 characters, respectively. Because Unicode is a much larger character set, some other encoding is needed. There are four possible encodings of Unicode characters: UCS-2, UTF-7, UTF-8, and UTF-16. The last two are the most common. UCS-2 only encodes the first 65,536 positions. UTF-7 uses only the first seven bits in a byte and was suitable for older email handling agents. UTF stands for Universal Character Set Transformation Format. UTF-8 uses one or more eight-bit bytes to encode Unicode characters with the first 256 characters encoding using a single byte just like Latin1. UTF-16 is similar to UCS-2 but has an escape mechanism to encode all the Unicode characters.


There are two formats for a character reference: decimal and hex. The decimal representation is

CharRef :== '&#'[0-9]+ ';'

For example:

<P> Here is a special character: &#169; </P>

In Unicode, as in Latin1, position 169 is the copyright symbol (©).

The hex representation is

CharRef :== '&#x'[0-9a-fA-F]+ ';'

Here is an example of a character reference in hex:

<P> Here is another special character: &#xB6; </P>

In decimal, this character is position 182, which is a paragraph symbol.

Note

The Unicode standard lists its characters in hex notation.


Just as character references are abbreviations for Unicode characters, the XML specification also allows you to define your own abbreviations called entities and refer to them with entity references.

Entities and Entity References

Entities and entity references are techniques for enabling reuse of both content and markup in your XML documents. Entities can be confusing because there are many different categories of entities; however, after you clearly understand the various categories and to what category a particular entity belongs, they make sense. Figure 4.1 depicts the hierarchy of entity categories.

Figure 4.1. Hierarchy of entity categories.


Before demonstrating each category, let's look at Table 4.1, which provides definitions of each category.

Table 4.1. Entity Categories
Category Definition
General entity Entities used only in document content.
Parameter entity Entities used only in document type definitions (DTDs).
Parsed entity An entity whose replacement content is text.
Unparsed entity Normally used for binary (non-text entities). If it is text, it may not be XML. Has an associated notation.
Internal parsed entity An entity that is declared in an instance of the XML document.
External entity An entity fetched from an external source. The external source is specified via a URI.

Note

An internal entity must be a parsed entity.


General Entities

In its simplest form (an internal, general parsed entity) an entity is just an abbreviation for larger text. For example, an entity dtd could be used to abbreviate the phrase document type definition.

You declare entities in your document type definition like this:

<!ENTITY dtd  "document type definition">

Another way to think about an entity is as a box with a label. The label is the entity's name. The contents of the box can be text or data. If the content of the entity is text, the standard calls this a parsed entity. Because the content is text it may also contain markup. For example:

<!ENTITY line "<P>This is a parsed entity. </P>">

An entity can be fetched from an external source specified by a URI. This is called an external entity. Here's another example:

<!ENTITY intro SYSTEM "http://www.gosynergy.com/intro.xml">

In an XML document, an entity is referred to via an entity reference. Here is the formal definition of an entity reference.

EntityRef :== '&' Name  ';'

Table 4.2 shows the predefined entities for the characters used to delineate markup

Table 4.2. Predefined Entities
Entity Reference Character
&amp; &
&lt; <
&gt; >
&apos; '
&quot; "

Note

Unlike HTML, which has many predefined entities, the entities listed in Table 4.2 are the only ones predefined in XML.


Here is a more complete example that uses a parsed general entity:

<!DOCTYPE BOOK [
<!ENTITY publisher "SAMS Professional publishing">
]>
<BOOK>
<PUBLISHER> &publisher; </PUBLISHER>
<P> Welcome to this book published by &publisher;.  This
&publisher; produces numerous professional titles every year.
</BOOK>

There are also unparsed entities for data such as images:

<!ENTITY image SYSTEM "http://www.wyweb.com/myhouse.gif" NDATA GIF>

Notice that unparsed entities are differentiated with the keyword NDATA followed by a notation name for that data. The notation name must be a declared notation. You declare notations in the DTD similar to the way you declare elements. The declaration of the GIF notation would be

<!NOTATION GIF SYSTEM "apps/imgviewer.exe">

So far we have seen parsed and unparsed entities and external and internal entities. There is one more distinction that can be applied to entities: general or parameter. So far, we have seen only general entities. Remember, a general entity is an entity used for text replacement in a document instance.

Parameter Entities

A parameter entity is an entity that is only used in a DTD. It is differentiated in both its declaration and reference by a % symbol.

Here is an example of an internal parameter entity declaration:

<!DOCTYPE EXAMPLE [
<!ENTITY % obj "<!ELEMENT OBJECT (#PCDATA)>">
%obj;
]>

Parameter entities have different rules for an internal DTD (called an internal subset) versus an external DTD (called an external subset). In an internal DTD you can only have whole declarations (as shown above). In an external DTD you can have a parameter entity for partial declarations. This is to make parsing for non-validating parsers (which must parse the internal subset) easier—for example:

<!ENTITY % nameAtt  "name CDATA #REQUIRED">
<!ATTLIST folder
         %nameAtt;>
<!ATTLIST bookmark
         %nameAtt;
         type CDATA #IMPLIED>

External parameter entities allow you to reuse common declarations. For example, an employee element may be used in several different markup languages across the business. Here is an example of an external parameter entity:

<!ENTITY % employee SYSTEM "http://www.super.com/xml/employee.dtd">
...
%employee;

Note

Markup may not span entity boundaries. The following is illegal:

<!DOCTYPE SAMPLE [
<!ENTITY start-tag  "<title>This is very">
<!ENTITY end-tag  "illegal. </title>">
]>
&start;&finish;


I'd like to make the following points about entities:

  • In an attribute value you can use an internal, general entity. For example:

    <!ENTITY favrest "Tippy's Taco house">
    <!ATTLIST menu
              date CDATA #REQUIRED
              restaurant CDATA #FIXED "&favrest;">
    
  • Entities must be declared before they are used.

Entities are a concept that will take experience to master. A good way to speed up your learning curve on entities is to examine DTDs and documents written by others. See http://xml.org for a repository and catalog of XML documents and DTDs.

Understanding Attribute Types

In Chapter 2, "Parsing XML," you learned how to declare attributes in a document type declaration using an attribute-list declaration. For example:

<!ELEMENT CONTACT (#PCDATA)>
<!ATTLIST  CONTACT  EMAIL CDATA #REQUIRED>

Attributes have types that enforce both lexical and semantic constraints. Table 4.3 provides a summary of the attribute types.

Table 4.3. Attribute Types
Type Definition
CDATA Any character data
Enumeration A list of Nmtokens ("name tokens") where only one may be used (similar to a choice content model)
NOTATION A list of names and a declared notation name
ID A name that uniquely identifies an element
IDREF A reference to an element (by its ID)
IDREFS May refer to one or more IDs (space delimited)
ENTITY A name of a declared entity
ENTITIES One or more declared entities (space delimited)
NMTOKEN An Nmtoken (see definition later in the chapter)
NMTOKENS One or more Nmtokens

Many of the definitions in Table 4.3 specify using either a name or a name token. A name is any valid XML name. An XML name must begin with a letter or an underscore, followed by any number of letters, digits, hyphens, underscores, periods, or colons. Colons are now used to denote namespaces (discussed next). XML names are used for all element and attribute names. For example:

<!ELEMENT BODY (#PCDATA)>

An NmToken or name token is any combination of legal name characters for XML names. In other words, all XML names are name tokens, but not all name tokens have XML names. Here are some sample name tokens:

.1.a.name.token.but.not.a.name
234_also_a_name_token_but_not_a_name
A_name_token_and_a_name

In Chapter 2, I covered the most common attribute data types (CDATA and enumeration); now I will both define and demonstrate all the available attribute types.

  • CDATA is the simplest type of attribute. It allows any character data except <, & (unless it starts a reference), or the quotation character used to surround the string. For example:

    <!ATTLIST QUOTE  DATE  CDATA #REQUIRED>
    ]>
    <QUOTE  DATE="February 9, 1999">... </QUOTE>
    
  • An Enumeration type allows an attribute to take one name token among a choice of any number of name tokens. For example:

    <!ATTLIST CHOICE (option1|option2|option3) #REQUIRED>
    
  • Name token (NMTOKEN) attributes are similar to CDATA except that they are restricted to valid name tokens (only name characters). An empty string is not a valid name token. Also, a name token cannot have whitespace. For example:

    <!ATTLIST QUOTE DATE NMTOKEN #REQUIRED>
    ... ]>
    <QUOTE  DATE="1999-02-09">... </QUOTE>
    

    The NMTOKENS declaration allows an attribute value to be one or more NMTOKENS separated by a space.

  • An ID attribute allows you to name a particular element so that it may be referred to later using an IDREF attribute. These ID attributes will also be used with XLINKS, which are discussed later. IDs are XML names. Every element can have at most one ID. All IDs specified in an XML document must be unique. IDREF attributes must refer to an ID in the document. Also, if you use the IDREFs designation, you may have an attribute that has one or more IDREFs as its value. For example:

       <!DOCTYPE PAPER [
       <!ELEMENT SECTION (TITLE, PARAGRAPH*)>
       <!ATTLIST  SECTION SEC-ID  ID  #IMPLIED>
       <!ELEMENT  CROSS-REFERENCE  EMPTY>
       <!ATTLIST  CROSS-REFERENCE  TARGET  IDREF  #REQUIRED>
       ... ]>
       <PAPER>
       <SECTION  SEC-ID="java.features"><TITLE> Java's Best
       features </TITLE>  ...  </SECTION>
       ...
       To refresh your memory, see the section titled <CROSS-REFERENCE TARGET="java.features" 
    /> </PAPER>
    
  • An ENTITY attribute is used to refer to an unparsed external entity.

       <!DOCTYPE BOOKREVIEW [
       ...
       <!ATTLIST BOOK COVER ENTITY #REQUIRED>
       <!NOTATION GIF SYSTEM "apps/gifview.exe">
       <!ENTITY  java-book1  SYSTEM
          "http://www.sellbooks.com/java/book1.gif" NDATA GIF>
       ]>
       <BOOKREVIEW>
       <BOOK  cover = "java-book1">... </BOOK>
       </BOOKREVIEW>
    

    You may also declare an attribute to refer to one or more entities using the ENTITIES designation.

  • A NOTATION attribute type is used to specify that an attribute value is one of several declared NOTATIONS. For example:

       <!ATTLIST COVER_IMG
                 type NOTATION (GIF|JPEG|BMP) "GIF">
    

After you declare your attributes and assign them to an appropriate type, you can use attributes in your document. The values assigned to those attributes are modified by a process called "normalization," which is discussed next.

Attribute Value Normalization and Whitespace Handling

Normalization and whitespace handling are detailed processes for handling specific text processing situations. This type of fine granularity is the basis of a good standard.

Attribute Value Normalization

Element attributes are name="value" pairs; however, the value between the quotes is first passed through a process called normalization. Here are the steps in the normalization process:

  • Surrounding quotes are stripped out.

  • Character references are replaced with their corresponding characters. For example, &#169; would be replaced with a copyright symbol.

  • General entity references are replaced with their corresponding text. This is a recursive process, which means that if the replacement text also contains references, they are replaced, and so on.

  • Whitespace characters (carriage return, line feed, tab and space) in attribute values are replaced by spaces. Also, the sequence CR-LF is replaced by a single space.

  • If an attribute type is anything other than CDATA, leading and trailing spaces are removed. Also, if using tokenized types, spaces between tokens are collapsed to a single space.

It is important to remember the distinction between unnormalized attribute value text and attribute value data (after normalization). For example:

<GRAPHIC  ALTERNATE-TEXT="This is a picture of
                             a penguin dancing.">
The attribute value is normalized to
This is a picture of a penguin dancing.

Whitespace Handling

You may remember that in Chapter 2 we contrasted XML to HTML in its treatment of whitespace. Whereas HTML disregarded whitespace, XML preserved whitespace in your document content. To be technically accurate, the specification requires that an XML processor (usually a parser) pass whitespace on to the application (the consumer program of the data). The application then can determine whether whitespace is significant. The specification provides a special attribute called xml:space that can be attached to any element in order to specify the proper treatment of whitespace to the application. Here is the form of the xml:space attribute:

    <!ATTLIST elemName
            xml:space (default | preserve) 'preserve'>

The elemName in the general form is any element you want to attach the element to. By convention, the attribute applies to that element and its children elements. The value 'preserve'specifies that the application should preserve all whitespace. The value 'default' indicates that the application's default processing for whitespace is acceptable (whatever that may be).

Another aspect of handling whitespace across heterogeneous platforms is the processing of end-of-line characters. The problem is that there are three widespread methods for handling end-of-line: Mac OS uses a carriage return (CR), UNIX uses a line feed (LF) and Windows uses a carriage return line feed (CR-LF) sequence. The XML specification requires that the XML processor convert any of the stated conventions to a single LF to signify end-of-line.

Any and Mixed Element Content Models

As previously stated, element type declarations start with the literal string <!ELEMENT followed by an element name and then a content specification:

<!ELEMENT html (head, body) >

The content specification can be one of four types: EMPTY, ANY, mixed content, or element content. The element content model is the most common. The EMPTY content model is for empty elements.

The ANY content model allows an element to contain any character data or child elements. This is a completely unstructured content specification and therefore is rarely used.

A mixed content element may contain character data, optionally interspersed with child elements. Here is the grammar for the mixed content specification:

Mixed ::= '(' S?  '#PCDATA' (S?  '|' S?  Name) *  S?  ')*'
              |  '(' S?  '#PCDATA' S? ')'

This grammar states that you can either have the literal #PCDATA followed by zero or more child element names or just have the literal #PCDATA by itself. PCDATA stands for parsed character data.

Example 1:  <!ELEMENT  NAME  (#PCDATA) >
Example 2:  <!ELEMENT  paragraph  (#PCDATA|quote|reference)* >

CDATA Sections

A CDATA section is used in a document when you do not want the content to be treated as markup. The most obvious example of using a CDATA section would be to pass XML markup into an application (instead of having it parsed as markup). A CDATA section starts with the string <![CDATA[ and ends with the string ]]>. For example:

<![CDATA[<TITLE> an XML example </TITLE>]]>

Another possible use of a CDATA section would be to pass source code to an application without having to use character references for reserved characters like the < or > symbol.

Note

A CDATA section is only allowed where #PCDATA is allowed in your XML document.


Conditional Sections

Conditional sections can only occur in the external subset of the document type declaration and in external entity references from the internal subset. A conditional section allows you to turn on and off a series of markup declarations. There are two keywords used with conditional sections: INCLUDE and IGNORE.

A conditional section may include one or more complete declarations, comments, processing instructions, or nested conditional sections. If the keyword used is INCLUDE, the section is processed. If the keyword is IGNORE, the section is not processed.

Here is an example of using conditional sections:

<! [INCLUDE [
<!ELEMENT  article (title, section+, references*)>
] ]>
<![IGNORE [
<!ELEMENT  article  (title, section+)>
]]>

This is very useful for turning on and off parts of a DTD during development. You can use an entity for the keyword of a conditional section. The processor will replace the reference before determining whether it should include or ignore the section. For example, the document could be rewritten like this:

<!ENTITY % editor "INCLUDE">
<!ENTITY % author "IGNORE">
<! [%editor [
<!ELEMENT  article (title, section+, references*)>
] ]>
<![%author [
<!ELEMENT  article  (title, section+)>
]]>

Processing Instructions

A processing instruction is used to pass additional information to one specific processing application without changing the way the document is processed by other applications. In general, processing instructions should be used infrequently. The format of a processing instruc tion is the literal <? followed by a name (the name of the target application), followed by any text and ending with the literal string ?>.

Note

The name of the application in a processing instruction may not be any variation of the letters XML.


Here is an example to change the font of the first word of a paragraph. You may have to do this if you are using someone else's DTD that does not have markup for something you want to do. For example:

<SECTION> The man stood on the beach.
<p> <?EZFormat Font="24Pt"?> Hey! <?EZFormat endFont ?>
</SECTION>

Another reason for processing instructions could be for sending special commands to a CGI program processing the XML prior to passing it to a client.

XML uses a special processing instruction for attaching XSL stylesheets to a document instance.

<?xml:stylesheet
           href="http://www.mystuff.com/memo.xsl"
           type="text/xsl" ?>

Last, remember that the XML declaration is a form of processing instruction.

Encoding and the Standalone Document Declarations

In the primer on XML, I discussed the XML declaration and stated that it contained some literal text <?xml, followed by version information, an optional encoding declaration, an optional standalone document declaration, and the literal text ?>. I will now examine the two optional parts of the XML declaration: the encoding declaration and the standalone document declaration.

The encoding declaration specifies the character set encoding for the following document. The earlier section "Character References" contains a note that defines character set encoding and how it relates to both characters and character sets. The specification requires all XML proces sors to support both UTF-8 and UTF-16 encoding. Support of all other encodings is optional. In the absence of an encoding declaration or a byte order mark (this allows auto-detection of a UTF-16 encoded file), the encoding must be UTF-8. Because ASCII is a subset of UTF-8, ordinary ASCII files do not need an encoding declaration. Here are some examples of encoding declarations:

<?xml version='1.0'encoding='UTF-16'?>
<?xml version='1.0'encoding='ISO-10646-UCS-2'?>

A list of Internet-supported character set names can be retrieved from

ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets

Note

Any external parsed entity may begin with a text declaration. A text declaration is identical to an XML declaration with the exception that the version declaration is optional and minus the standalone document declaration. For example:

<?xml encoding='UTF-16'?>


The standalone document declaration is only used rarely and is not recommend for general use. As we stated previously, a DTD can be composed of both an external subset and an internal subset. An external subset is stored elsewhere and referenced via a URI. A standalone document declaration declares whether an application needs to fetch the external subset of the DTD to process the document correctly. For example:

<?xml version="1.0"  standalone="yes" ?>
<!DOCTYPE  HTML  SYSTEM  http://www.xmlstuff.com/html.dtd>
<HTML> ... </HTML>

This example would state to a processor that the client does not need to fetch the DTD to properly process the document. It is important to note that a document is not valid unless both the external subset and internal subset of a DTD has been processed.

Another scenario for using the standalone document declaration is if multiple programs process a document but only the first one validates the document. All ensuing programs could safely skip that step.

The XML Grammar

The XML grammar is specified using an Extended Backus-Naur form (EBNF) notation. Using an EBNF defines a context-free grammar—a grammar that is independent of the context in which it is used. The notation for definitions in the grammar is

symbol ::=  expression

where expression defines the rule for creating the symbol on the left-hand side. This formal grammar ensures that there is no ambiguity in the XML syntax. All the legal expressions are precisely defined via EBNF.

Note

It is important to keep in mind that this section refers to EBNF syntax and not XML syntax. The purpose for reviewing EBNF is to give you the ability to consult the XML specification when necessary.


EBNF statements are also called production rules, because they express the way in which valid symbols are constructed or produced using other symbols or specific fixed strings.

Table 4.4 shows EBNF notations in the grammar and their meaning.

Table 4.4. EBNF Notations
Notation Description
(expression) A group expression treated as a single unit.
#xN Where N is a hexadecimal integer. This notation matches a specific UCS character.
"string" Matches a literal string.
'string' Matches a literal string.
A? Matches A or nothing; Means A is optional.
A+ Matches one or more occurrences of A.
A* Matches zero or more occurrences of A.
A B Matches A followed by B.
A | B Matches A or B, but not both.
A - B Matches any string that matches A but does not match B.
[a-zA-Z] [#xN-#xN] Matches any character with a value in the ranges (inclusive).
[^a-z], [^#xN-#xN] Matches any character with a value outside the range.
[^abc] Matches any character not among the given characters.

Here is a snippet of the XML grammar in the XML specification:

elementdecl  ::=  '<!ELEMENT' S  Name  S  contentspec  S?  '>'
S  ::=  (#x20  |  #x9  |  #xD  |  #xA)+
Name  ::=  (Letter  |  '_' |  ':')  (NameChar)
contentspec ::=  'EMPTY' |  'ANY' |  Mixed  |  children
children  ::=  (choice  |  seq)  ('?' |  '*' |  '+') ?
cp  ::=  (Name  |  choice  | seq)  ('?' |  '*' |  '+') ?
choice  ::=  '(' S?  cp  (  S?  '|' S?  cp  )*  S?  ')'
seq  ::=  '(' S?  cp  (  S?  ',' S?  cp  )*  S?  ')'
Mixed  ::=  '(' S?  '#PCDATA' (S?  '|' S?  Name)*  ?  ')*'
                   |  '(' S?  '#PCDATA' S?  ')'

Note

See Letter and NameChar rules in the XML Recommendation. They occupy several pages and were left out for brevity.


Here is a partial translation of the production rules:

An element declaration is the literal <!ELEMENT, followed by a space, a legal XML name, a symbol called contentspec, (optionally) a space, and finally the literal >.

A contentspec is either the literal EMPTY or ANY, or the translation of the symbol Mixed or the symbol children.

A children symbol is translated as either a choice or a sequence followed optionally by an occurrence indicator, which is the literal ?, *, or +.

A choice symbol is translated as the literal (, an optional space, a symbol called cp (a con- tent particle), zero or more literals | with more content particles (and optional space), and a literal).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.131.178