Components of XML

As stated in the section on the history of XML, XML is a formalization of rules for "marking up" documents with meta-data to convey extra information (the data's purpose) to the user. So, an XML-compliant document can be separated into markup and content. All markup starts with either an ampersand (&) or left angle bracket (<). There are six types of markup defined in XML. They are as follows: elements, attributes, comments, processing instructions, entity references, and CDATA sections.

  • Elements— Elements are the most common aspect of markup languages. An element is a logical construct of a document. A normal element is composed of start and end tags that surround content, other elements, or both. The tags of an element are delimited by angle brackets. For example:

    <STREET> 4296 Razor Hill Road </STREET>
    
  • Attributes— An element may have attributes that are specified in name/value pairs and are placed after the start-tag name. In this example, the width and height are the attributes:

    <APPLET width="100" height="200">
    
  • Comments— A comment allows free text description that is ignored by an XML processor. For example:

    <!-- Keep this part it is really important. -->
    
  • Processing Instructions— Processing instructions are used to pass information to a processing application. Here is the syntax of a processing instruction:

    <?application data ?>
    
  • Entity References— Entity references are used to put reserved characters or abbreviations in markup. For example, the left angle bracket (<) is a reserved character. To put that character in your markup, use the reference &lt;. The lt stands for less than and the ampersand (&) and semicolon (;) delimit the reference in the content. A list of all the predefined character references is provided in Chapter 4.

  • CDATA Sections— A CDATA section is a section of text that should not be processed but instead passed directly to the application. This is useful for passing source code to an application. CDATA sections will be covered in detail in Chapter 4.

In addition to marking up your document, you can specify the rules that govern your new markup language. These rules will determine whether a document of your type is valid (follows your rules). You specify the rules for your language in a document type definition (DTD). In your document (called an instance of the DTD), you then indicate which DTD specifies the rules for your language in a document type declaration. Inside a DTD, you can declare elements, element attributes, and entity references. Figure 1.1 shows the one-to-many relationship between DTDs and instance documents, as well as the distinction between the document type declaration and the definition.

Figure 1.1. Relationship between DTD and instances.


BNF Grammar

When describing each specific component of XML, at times we will refer to the exact words in the specification. To make sense of the specification, you must understand how it strictly defines the grammar of this language. To express XML in a manner in which computer- generated parsers can be built, the designers defined the language using Backus-Naur form (BNF). BNF describes a context-free grammar which is a system of definitions in the following form:

    part_of_speech ::= definition

The ::= means "is defined as". A definition that is composed of a left-hand side (LHS) and a right-hand side (RHS) is called a production. For example, a production for an English sentence would be:

    sentence ::= subject predicate PERIOD

A sentence is a subject followed by a predicate followed by a PERIOD. There are two types of symbols on the right-hand side: nonterminal symbols and terminal symbols. A nonterminal symbol must be defined further in the grammar (like the subject in a sentence). A terminal symbol is a literal word or symbol (like a "." for PERIOD) that needs no further expansion.

There are other ways to describe the right-hand side of a production. You can have the LHS equate to one among a choice of symbols. To do this, use the pipe symbol (|), which is commonly used as the OR operator in programming languages such as C and Java. Consider the following production:

    Space ::= (#x20 | #x9 | #xD | #xA)+

This production would read as follows:

Space is defined as one or more spaces (character #20 in hex), carriage returns (#9) in hex, line feeds (#13 in hex), or tabs (#10 in hex). Note that in hexadecimal (which is a base-16 numbering system), the values 10 through 15 are represented with the letters A through F. In the production, it is important to notice the plus sign outside of the parentheses. The plus sign is an occurrence indicator that stands for "one or more". You will see this occurrence indicator again, as it is the same one used in the XML document type declaration. Other occurrence indicators are an asterisk (*), which means "zero or more" and a question mark, which means "zero or one".

Now that you have a general idea how a BNF Grammar works, we will examine how the document prolog is precisely defined in the XML Specification using BNF. We will examine BNF in more detail in Chapter 4. It is introduced here to highlight the fact that there is no ambiguity in the language specification. Every component of XML syntax is fully specified in BNF.

Prolog

A prolog is an introduction to a document. Here is a sample prolog of an XML document:

<?xml version="1.0"?>
<!DOCTYPE JDATA SYSTEM "javadata.dtd">

The prolog consists of two parts that are both optional. The first part is called the XML declaration and the second part the document type declaration. The specification states that both parts of the prolog are optional. Here is the formal definition (in BNF) of the prolog as listed in the XML specification:

[22] prolog ::=  XMLDecl?  Misc*  (doctypedecl  Misc*)?
[23] XMLDecl ::= '<?xml'VersionInfo EncodingDecl? SDDecl? S?
                 '?>'
[24] VersionInfo ::= S 'version'Eq ('VersionNum '|
                "VersionNum")
[25] Eq ::= S?  '=' S?
[26] VersionNum ::=  ([a-zA-Z0-9_.:]  | '-')+
[27] Misc ::=  Comment  |  PI  |  S

Let's translate each production rule into a corresponding English sentence. We will cover all the symbols of the BNF syntax in more detail in Chapter 4; this section will just be an introduction.

Production 22 states that "a prolog is defined as an optional XML Declaration followed by one or more miscellaneous symbols (defined in production 27 as a comment or processing instruction or space) followed by an optional document type declaration."

Production 23 states that "an XML declaration is defined as the literal string '<?xml' followed by Version information followed by an optional Encoding declaration followed by an optional standalone document declaration followed by optional space then followed by the literal string '?>'."

Production 24 states that a version number is defined as "space followed by the literal string 'version'followed by an equal sign followed by a version number, which is surrounded by either single quotes or double quotes."

Production 25 states that the equal sign may be preceded or trailed by optional space (remember the production for space or whitespace was given earlier).

Production 26 states that a version number is defined as one or more of the following characters to include lowercase a through z, uppercase A through Z, the digits 0 through 9 or one of the following punctuation marks to include underscore, period, colon or hyphen. Notice that the hyphen must be declared as a literal separately because without the surrounding quotes it refers to a range of characters.

So, as stated more formally earlier, a prolog consists of an optional XML declaration and an optional document type declaration (discussed later). Almost all XML-compliant documents will use an XML declaration. An XML declaration consists of the literal string '<?xml', followed by version information, followed by an optional encoding declaration, followed by an optional standalone document declaration, and ending with the literal text '?>'. Here is the minimum requirement for an XML declaration:

<?xml  version="1.0" ?>

If used, this must be the first item in every XML-compliant document you create. It specifies to a processing application which version of the specification your document adheres to. The optional encoding declaration is used to specify what character encoding your document will use. The default is Unicode (same as Java character set) encoded in UTF-8 format. The optional standalone document declaration is rarely used but specifies whether a processing application can skip downloading the document type definition to process the document. It takes the literal values of "yes" or "no." Here is another example of an XML declaration:

<?xml  version="1.0" encoding="UTF-8" standalone="no" ?>

Elements

An element is a logical component of a document. An element normally contains content, either character data or other elements, and is denoted by a start and end tag. For example:

     <TITLE> This is an HTML title. </TITLE>

or

    <HTML>
        <HEAD> ... </HEAD>
        <BODY> ... </BODY>
    </HTML>

In XML, you can only have a single root element. That root element has subelements which may also have subelements. The structure of an XML document is a tree of elements. So, as stated earlier, if you think of an element as a container, an XML document becomes a container of containers. Containers have a name associated with them (the element name) and possible additional characteristics (called attributes). The containers hold the content (or data) of the document. The start and end tags define the boundaries of the container.

A start tag consists of a left angle bracket (<), an identifier or tag name, and a right angle bracket (>). Elements may also have attributes (discussed next). An end tag is identical to a start tag except that the identifier is preceded with a forward slash (/). In the XML specification, the precise definition of tag is < followed by an element-type name, followed by a >.

Some elements contain no content. These are called empty elements. You denote this by preceding the greater-than angle bracket with a forward slash—for example, <EmptyTag/>. Usually empty tags have attributes—for example:

<IMG SRC="house.jpg" />

Note

In XML, an element with content must have a start tag and an end tag.


Attributes

In addition to content, elements may have attributes. Attributes allow you to attach characteristics to an element. Attributes have a name and a value and are placed within the start tag—for example:

<APPLET code="myapplet.class">

In the document type definition (DTD), you define the legal attributes for an element and what values are legal for that attribute. We discuss creating a DTD in a following section.

An element can have multiple attributes—for example:

<APPLET code="myapplet.class" height="100" width="100">

In XML, the value must be surrounded by single or double quotes. When you use one type of quotes, the other type is legal within the quotes—for example:

<PERSON quote=" 'To Be or Not to Be'">

Comments

A comment is extra information for humans who read the document, and is not meant for a program to use. When parsed by a computer program, comments are not passed on to the application. The format of a comment is as follows:

<!-- Note: remember to write more stuff here. -->

You are not allowed to put the characters "--" inside your comment content. Comments cannot go inside of tags or within other comments. Comments are useful to describe elements, especially cryptic ones such as <P>.

Document Type Definition

We have learned about the elements and attributes that represent the key indicators of the structure or purpose of our content but we have not yet covered how to determine which tags we can use in a document. A document type definition (DTD) declares all the legal elements in a document; the legal attributes those elements can have; and the hierarchy, nesting, and occurrence indicators for all elements. In order for a document to be valid it must specify what DTD it adheres to. A document specifies its DTD in a document type declaration.

If you remember back to the Prolog section, a document type declaration is optional but if present, occurs after the XML declaration and before the root element of the document. There are two forms of document type declarations: internal and external. An internal document type declaration is when you include the declaration in the same document with the markup and content. This is practical only for short DTDs. Listing 1.3 is an example of an internal DTD.

Code Listing 1.3. Internal DTD
<?xml version="1.0"?>
<!DOCTYPE JDATA
[
<!ELEMENT JDATA (OBJECT)+>
<!ELEMENT OBJECT (PRIMITIVE|OBJECT|ARRAY)+>
<!ELEMENT PRIMITIVE (#PCDATA)>
<!ELEMENT ARRAY (PRIMITIVE+|OBJECT+|ARRAY)>
<!ATTLIST PRIMITIVE
    name    CDATA    #IMPLIED
    type    (string|byte|char|short|int|long|float|double|boolean) #REQUIRED>
<!ATTLIST OBJECT
name    CDATA    #IMPLIED
    type CDATA    #IMPLIED>
<!ATTLIST ARRAY
    name    CDATA    #IMPLIED
    type    (array|object|string|byte|char|short|int|long|float|
 double|boolean) #REQUIRED>
]>

<JDATA>
<OBJECT name="test">
<PRIMITIVE name="anInt"
       type="int">200 </PRIMITIVE>
</OBJECT>
</JDATA>

All document type declarations start with the string literal <!DOCTYPE. The next word is the document name, which must correspond to the root element of the document. For an internal DTD, you include all the element declarations and attribute declarations between the open bracket ([) and the close bracket (]).

An external DTD is stored externally and referenced via a Uniform Resource Identifier (URI) which is similar to the more familiar Uniform Resource Locator (URL). A URL is one form of a URI and is the most common method for accessing external DTDs. Listing 1.4 is an example of an external DTD declaration.

Code Listing 1.4. External DTD Declaration
<?xml version="1.0"?>
<!DOCTYPE JDATA SYSTEM "javadata.dtd">

<JDATA>
<OBJECT name="test">
<PRIMITIVE name="anInt"
       type="int">200 </PRIMITIVE>
</OBJECT>
</JDATA>

For external DTDs, the third part of the DTD declaration must be either SYSTEM or PUBLIC. If it is SYSTEM, the final part must be a Uniform Resource Identifier (URI). A URL is a valid URI. The format of a URL is as follows:

protocol://host[:port]/path/resource

The most common protocol in a URL is HTTP, the Hypertext Transfer Protocol. That is the protocol that Web browsers use to communicate with Web servers. A URI allows URLs to be extended with an optional query and an optional fragment identifier. We will discuss URIs in more detail in a future section on XPointers. In the example above, the URL is a relative URL, which means that the DTD is in the same directory as the current document. If you declare a DTD as PUBLIC it is followed by a unique name so that the software knows where to locate the DTD. In essence, the software vendors would provide these built-in DTDs.

The key benefit of an external DTD is for numerous documents to share a single DTD. By sharing the DTD, any changes to the DTD automatically affect all instance documents that refer to that DTD. Thus you have central management of the specification of your markup language. Secondly, space and time can be saved by not replicating a DTD among all instances of the document.

Element Declaration

Every element in a valid XML Document must correspond to an element type declared in a DTD. Element type declarations start with the string literal <!ELEMENT, followed by the element name, and then by a content specification. Here is the general form for an element declaration:

    <!ELEMENT  elemName  content-spec>

Element type names are XML names. An XML name must begin with a letter or an underscore, followed by any number of letters, digits, hyphens, underscores, or periods.

In general, the content specification states whether this element will contain child elements or character data. If the element will contain child elements, you further specify ordering, repeatability and presence requirements. The content specification is one of four types. Table 1.3 lists the element content types.

Table 1.3. Element Content Types
Content Specification Type Content
EMPTY content No content
ANY content May have any content (character data, subelements, or both)
Mixed content May have character data or a mix of character data and subele ments specified in the mixed content specification
Element content Only has subelements as specified in the element content specification

Now let's describe each content model specification:

  • EMPTY content— EMPTY content is the model for empty elements.

  • ANY content— ANY content is rarely used. Normally your elements have a structure. For example, an HTML document has subelements head and body. A head element can have subelements, and so on.

  • Mixed content— A mixed content allows both character data and subelements. The definition of the subelements is exactly as described in the element content specification below.

  • Element content— The element content specification is the most common. You specify an element (or child) content specification by creating a content model. A content model is a pattern that specifies what subelements are allowed, and what order they should occur in.

Let's examine the element content specification in more detail by working through a series of examples.

<!ELEMENT shipping EMPTY>

This declares an element called shipping with an EMPTY content specification. Notice that parentheses are not used because they are in the mixed and element content specifications. The ANY content model would be specified by just replacing the word EMPTY in the above example with ANY.

<!ELEMENT memo (name)>

This example states that the memo element may only have a single subelement called name. The parentheses are used and form what is called a content particle. Content particles may be nested inside other content particles. This is demonstrated below.

<!ELEMENT memo (name, date)>

This example states that the memo element will have one name element, followed by one date element. The comma specifies a sequence, meaning that date must follow name. This is a called a sequence content particle.

<!ELEMENT FIGURE (GRAPHIC | TABLE | SCREEN-SHOT)>

This example states that a figure will have a GRAPHIC or a TABLE or a SCREEN-SHOT subelement. This is called a choice content particle.

To make elements repeatable or optional you use an occurrence indicator. Table 1.4 lists the three occurrence indicators.

Table 1.4. The Content Particle Occurrence Indicators
Indicator Element or Content Particle Can Occur…
? zero or one time (optional)
* zero or more times (optional and repeatable)
+ one or more times (required and repeatable)

Now let's examine some examples using occurrence indicators.

<!ELEMENT JDATA (OBJECT)+>

This example states that a JDATA element must have at least one and maybe more OBJECT elements.

As stated above, if you surround an element or sequence or choice of elements in parentheses you create a content particle. This is important because you can put an occurrence indicator on both an element and a content particle.

<!ELEMENT OBJECT (PRIMITIVE|OBJECT|ARRAY)+>

This example states that an OBJECT can be one or more of the following choices: PRIMITIVE, an OBJECT, or an ARRAY. The choice forms a content particle which is then repeated one or more times using the '+' occurrence indicator.

<!ELEMENT PRIMITIVE (#PCDATA)>

This example states that a PRIMITIVE element can contain just character data and no subelements. The literal string #PCDATA stands for parsed character data. Parsed character data is the technical term for the text content of your document.

<!ELEMENT ARRAY (PRIMITIVE+|OBJECT+|ARRAY)>

This example states that an ARRAY element can contain one of the following: one or more PRIMITIVE elements, one or more OBJECT elements, or a single ARRAY element.

<!ELEMENT FOO ( (PRIMITIVE, ARRAY) | (ARRAY, PRIMITVE) )>

This example demonstrates nested content particles. It declares an element named FOO that may only have two child elements (PRIMITIVE and ARRAY) but in any order.

Attribute-List Declaration

As stated previously, attributes of an element take the form of name/value pairs. For example:

<A HREF="http://www.gosynergy.com">

In a DTD, an attribute list declaration begins with the string literal <!ATTLIST and then is followed by the element name these attributes are for. After the name, you add one or more attribute declarations. An attribute declaration consists of three parts: the attribute name, its type, and a default declaration. The general form of an attribute declaration is

    <!ATTLIST  elemName
                 attName  attType  default-decl>

Here is an example attribute list declaration:

<!ATTLIST A
                    HREF CDATA #REQUIRED>

This example states that the A element has one attribute called HREF. This element is any character data (that is, no predefined list of legal values) and it is a required attribute. There are three attribute types: a string type (CDATA), a set of tokenized types, or an enumerated type. The two most common are the string or enumerated type. Tokenized types are discussed in Chapter 4.

You can declare many attributes in a single attribute-list declaration. For example,

<!ATTLIST EMPLOYEE
          STATUS   (SALARIED|HOURLY)   #REQUIRED 
          WORKWEEK  CDATA  #IMPLIED>

This example states that the EMPLOYEE element has two attributes, STATUS and WORKWEEK. The STATUS attribute's legal value is one of two choices in an enumerated list. The STATUS attribute is required, whereas the WORKWEEK attribute is optional.

There are four attribute default declaration options: #REQUIRED, #IMPLIED, #FIXED or a default value. A REQUIRED declaration means the attribute must be present with the element. An IMPLIED declaration allows the document writer to optionally include the attribute. With a FIXED declaration you must supply the value at which the attribute is fixed. Last, you can have a default value for an attribute that will be used if the user does not override it. Here is an example that uses a default value:

<!ATTLIST REMINDER
                type  (birthday|anniversary|other) "other">

Syntactic Rules

Since XML is a document markup specification, it has syntax rules. Syntax dictates the rules for words to form larger grammatical constructs such as sentences. The following are syntactic rules for XML-compliant documents:

  • XML documents are composed of Unicode characters. Unicode is a 16-bit character set that covers all the world's languages.

  • XML is case-sensitive. <HTML> and <html> are two different tags.

  • Whitespaces are invisible characters such as space (ASCII 32), tab (ASCII 9), carriage return (ASCII 13), and line feed (ASCII 10). Whitespace is ignored inside of tags; however, whitespace is significant in content. By content, I mean the text between a start tag and an end tag. By significant, I mean that an XML Processor must pass it on to the using application.

  • You often name things in XML like elements and attributes. An XML name must begin with a letter or an underscore, followed by any number of letters, digits, hyphens, underscores, or periods. For example:

    MyUnique123Tag-Identifier
    2_an_illegal_tag
    
  • An XML name may not begin with xml (uppercase or lowercase). That is reserved for the specification creators.

  • Literal strings are enclosed in single or double quotes.

  • Other syntactic rules (full nesting, empty tags, start and end tag pairs) were listed when explaining differences between HTML and XML. To recap, XML tags must be properly nested. If an element exists inside another element, its end tag must be before the enclosing element's end tag. An empty tag is a start tag that ends with a forward slash before the right angle bracket. Lastly, a start tag must have an end tag.

  • An XML document may have only a single root element.

Valid and Well-Formed Documents

There are two tests by which an XML-compliant document's correctness can be assessed: valid and well-formed. An XML document is valid if it declares a document type definition and conforms to the element and attribute declarations in the DTD. An XML document is well-formed if it follows all the syntax rules specified in the XML specification. A document does not require a document type definition (DTD) to be well-formed.

Several validating XML parsers are available free. A validating parser will ensure that a document is both well-formed and valid and report any violations to the user. Sun Microsystems, IBM, and Microsoft all provide free validating parsers. In this chapter, we will validate documents using the IBM Java XML parser called xml4j. This parser can be downloaded free from http://www.alphaworks.ibm.com. The download will be in the form of a zip file or a gzipped tar file. Extract the zip file or tar file to a directory. It will create a directory called xml4j with several subdirectories. In the xml4j_x_x_x (The x's will be replaced with version numbers.) directory you will see three jar files: xml4j.jar, xerces.jar and xercesSamples.jar. Add these three jar files to your classpath. For example, in Windows you would type

>set classpath=%classpath%;c:xml4j_x_x_xxml4j.jar;
c:xml4j_x_x_xxercesSamples.jar; 
c:xml4j_x_x_xxerces.jar;

After installing the xml4j code and a Java virtual machine (1.1 compliant or greater) you can run the class sax.SAXWriter to ensure that your documents are well-formed and valid. You will do this in the next section.

Creating a Markup Language from Scratch

Let's now do a step-by-step walkthrough of creating an XML-based language and a document of that type.

Address books are a very common data item that many applications store in their own proprietary format. Thus they are a good candidate for a standard language so that any application could read and write the data.

The best way to begin creating your language is to first examine several examples of your data so that you can extract the appropriate meta-data that describes the purpose of your data.

Example 1. A home address:
Michael Daconta
4296 Razor Hill Road
Bealeton, VA 22712
Meta-data:
Name
Street
City, State Zip
Example 2. A business address:
Sterling Software
7900 Sudley Road
Suite 500
Manassas, VA 20109
Meta-data:
Name
Street
Street2
City, State Zip

In examining your meta-data it is easy to see the elements that you need for your simple language: NAME, STREET, CITY, STATE, ZIP. The only decision you need to make is how to handle multiple street designations. There are two ways to do this:

  • Allow multiple street tags.

  • Allow multiple street tags and distinguish the street designations with an attribute.

Let's use the simpler approach, which is the first option. Now you are ready to write a well-formed XML document for your data. Listing 1.5 is an example of an Address Book Markup Language (ABML) document.

Code Listing 1.5. An ABML Document with External DTD Declaration
<?xml version="1.0"?>

<!DOCTYPE ADDRESS_BOOK SYSTEM "abml.dtd">
<ADDRESS_BOOK>
    <ADDRESS>
<NAME>Michael Daconta </NAME>
<STREET>4296 Razor Hill Road </STREET>
        <CITY>Bealeton </CITY>
        <STATE>VA </STATE>
        <ZIP>22712 </ZIP>
    </ADDRESS>
    <ADDRESS>
        <NAME>Sterling Software </NAME>
        <STREET> 7900 Sudley Road</STREET>
        <STREET> Suite 500</STREET>
        <CITY>Manassas</CITY>
        <STATE>VA </STATE>
        <ZIP>20109 </ZIP>
    </ADDRESS>
</ADDRESS_BOOK>

Now that you know generally how you want your language to look, you need to precisely define the rules of a valid document by creating the document type definition. As stated earlier in this chapter, a DTD consists of element type declarations and attribute-list declarations. You start with your elements and describe the allowed sequence, choice, and occurrence of subelements.

  • An ADDRESS_BOOK must have one or more ADDRESS elements.

  • An ADDRESS must have a NAME, STREET, CITY, STATE, and ZIP element in that sequence.

  • An ADDRESS element may have multiple STREET tags.

  • NAME, STREET, CITY, STATE, and ZIP elements can only contain parsed character data.

Here are the element declarations for the DTD:

<!ELEMENT ADDRESS_BOOK (ADDRESS)+>
<!ELEMENT ADDRESS (NAME, STREET+, CITY, STATE, ZIP)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT STREET (#PCDATA)>
<!ELEMENT CITY (#PCDATA)>
<!ELEMENT STATE (#PCDATA)>
<!ELEMENT ZIP (#PCDATA)>

Now, these elements were very simple and did not have any attributes; however, you would add an attribute to the STREET element by doing the following:

<!ATTLIST STREET TYPE (street|suiteno|aptno|other) #IMPLIED>

This would allow people to optionally add a TYPE attribute to the STREET element.

Now run the IBM validating parser on the ABML document to ensure it is both well-formed and valid. First cd to the sams.chp2 directory, where the file myaddresses.xml is located. Then type the following command:

> java sax.SAXWriter p com.ibm.xml.parsers.ValidatingSAXParser myaddresses.xml

Running the program produces Figure 1.2. If the document has no errors, the document is printed. If the document is not well-formed or does not match the DTD, errors are printed that detail the exact line where the problem occurs.

Figure 1.2. xml4j validating parser.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.254.35