Chapter 7. Understanding XML

In this hour, you will learn about the need for XML. In the beginning, you will learn a little about the rules of grammar that govern XML. You will learn how XML namespaces can guarantee that there won’t be any duplicate tag names in your document. Next, we will show you how you can be sure that the documents you receive are correct. In this hour, you will learn

  • Why XML is needed

  • The rules for creating an XML document

  • Avoiding naming conflicts

  • How to validate that the document is correct

Understanding Why We Need XML

Most knowledge workers like technology only when it solves a problem that is vexing them. Before launching into a discussion of the details, we should first pause and ask ourselves why we need XML anyway.

The idea of interconnecting two computers and exchanging data between them was born on the same day that the second computer was installed. We don’t often scatter our data around on purpose; it happens naturally. Through mergers, acquisitions, software platform obsolescence, and new technical innovations, we end up with islands of data all over our network. Whenever we go to assemble the data that we need to support decision making, we invariably need to combine data that doesn’t reside on the same computer.

Understanding Why We Need XML

The solution to this problem is to export the data from one computer system and import it into another. The simplicity of this statement hides a multitude of ugly details. Hardware engineers solved the problem of moving files between computers long ago. In addition, software engineers have also solved the technical problem of converting EBCDIC encoding into ASCII and converting the byte order. The problem is not in getting a text file from one computer to another in a readable format; the problem is how to write programs that can figure out what the data means when you get it there.

Several years ago, a major defense contractor launched an effort to place the instructions for how to assemble its airplanes online. For 40 years, these instructions had been printed and bound into a book. The mechanics who built the planes took out the book, read it, and then did the work.

The justification for the new system was based on the fact that these instructions contained many cross-references to other documents. The requirements stated that these cross-references must be changed into hyperlinks in the online system. The names of the hyperlinked files existed on the printed form and in the word-processed version of the document. They were embedded, however, in the middle of all the specifications and instructions.

There was no indication that this was a cross-reference, even in the electronic version of the paper document. Sometimes these special document references would be on the fifth line in one document but on the seventh line of another. Sometimes they started with an x, but other times with an r. Some of them were numeric, but others also contained letters. Needless to say, determining which pieces of text represented instructions and which pieces represented references to other documents was a nontrivial problem. A typical instruction looked like this:

"Drill hole according to procedure x151-1 and insert rivet.  Seal the top
of the rivet using sealant 212, Material Safety Data Sheet MSDS-2324."

After much trial and error, a fairly sophisticated parser was written that achieved about a 98% accuracy rate. Imagine how much easier this task would have been if this document had been prepared in the following manner:

<instruction>
  Drill hole according to procedure
  <standard procedure> x151-1</standard procedure>
  and insert rivet
</instruction>
<instruction>
 Seal the top of the rivet using sealant 212, Material Safety Data Sheet
 <safety sheet> MSDS-2324<safety sheet>
</instruction>

The programmer could easily differentiate between instructions, standard procedures, and safety sheets if given a document that has been formatted in this way. This concept is at the heart of XML. An XML document is a document filled with tags telling the reader, either human or computer, what the data means.

Using XML greatly simplifies the task of preparing documents for exchange between computer systems. The XML specification contains a set of syntax rules that are fixed and inviolable. These rules govern the format that tags must obey, the special characters that are allowed and what they mean, and the format of the document as a whole.

The specification does not contain the meanings of the tags that will be used. The vocabulary is created by the XML users according to their business needs. This is somewhat analogous to the English language. English grammar rules tell us that proper names are capitalized, question sentences contain a “?” at the end, a space is needed between words, and so on. The grammar doesn’t tell us what the vocabulary consists of; that is the job of the dictionary. In fact, the dictionary doesn’t even define the full vocabulary; we are free to make up new words to fit our needs.

The tag <horse> can mean anything that you and I agree that it means. In fact, it can mean cow, if we so desire (and could tolerate the confusion that would ensue). As long as the programs that I write interpret the tag the same way that you expected them to, all is well.

The vocabulary of an XML document is based on an agreement between two or more parties. Automotive manufacturers can create a special set of tags that describes painting instructions. Two bakers can create a vocabulary to describe recipes and a grammar to describe the relationships between the tags in the vocabulary. These vocabularies can be thought of as mini-dictionaries. Once a vocabulary of tags is created, it can be published and used by any number of organizations to exchange data in an XML format.

The Components of XML

Now that we have discussed the basic purpose of XML, let’s look at the major components that compose it:

  • XML document—A file that obeys the rules of XML. It contains data and can be thought of as a data store or a mini-database. In addition, an XML document can be loaded into the memory of the computer by a program. While in memory, it can still be referred to as an XML document.

  • XML parser—A computer program that takes XML as its input and produces a program-readable representation of its contents. Processing data in XML format would be very inefficient, so the parser transforms it into data structures that are efficient to process.

  • Document Type Definition (DTD)—A description of the tags that are allowed to exist in a document and their relationships to each other. The DTD was made obsolete by the publication of the XML schema specification in 2001. You still see them, however.

  • XML schema—A description of the tags that are allowed to exist in a document and their relationship to each other. You validate the document against the XML schema to ensure that it contains tags that obey the rules set forth in the schema. This validation takes place outside of your programs by an XML parser, relieving you of the burden of writing this code in every system that you work on. The XML schema is a new and much improved version of the DTD.

  • Namespaces—A unique name can be used to avoid conflicts between tag names. Because an XML document can contain other XML documents, we must have a way to guarantee that none of the included document’s tags are identical to one of the main documents. By creating a namespace for each XML document, included tags are always unique to the original document.

All these pieces work together to support your applications. They allow you to create systems that transfer data easily, and can be parsed and validated in a standard way.

The XML Grammar Rules

The purpose of covering the basics of XML grammar in this section is to permit you to understand the topic of Web services at a deeper level. The fact that you bought this book indicates that you are looking for more than a superficial understanding of how Web services work. To acquire this knowledge, it is critical that you be able to read XML files, even if you never plan on writing one yourself.

Here is the set of rules that an XML document must follow:

  • Each start tag must have a corresponding end tag.

  • Attribute values must be enclosed in quotes.

  • Some characters in data must be represented by entity references. If they appear in text as ordinary characters, the XML parser becomes confused.

  • Improperly nested tags are not permitted. If you start a tag sequence <a><b>, it must end <b><a>, not <a><b>.

  • The document must have the XML prologue: <?xml version='1.0'?>.

Documents that follow these rules are considered “well formed.” If a document is not well formed, it will cause errors to be thrown during parsing.

This set of rules is really pretty small when compared to the power of the XML technology. Listing 7.1 shows a sample XML document.

Example 7.1. The TicketRequest.XML File

<?xml version='1.0' encoding='utf-8' standalone='yes' ?>
<!--This XML document represents a request for a cruise ticket-->

<ticketRequest>
   <customer custID="10003" >
      <lastName>Carter</lastName>
      <firstName>Joseph</firstName>
   </customer>
   <cruise cruiseID="3004">
      <destination>Hawaii</destination>
      <port>Honolulu</port>
      <sailing>7/7/2001</sailing>
      <numberOfTickets>5</numberOfTickets>
      <isCommissionable/>
   </cruise>
</ticketRequest>

The first line is called the prologue:

<?xml version='1.0' encoding='utf-8' standalone='yes' ?>

The first entry contains the version of XML in which it was written. This can be important because future releases of XML might force parsers to be aware of the version. The encoding called utf-8 contains the standard Western European character set. The standalone=yes keyword tells us that an external DTD isn’t used to validate this document.

The second line is a comment:

<!--This XML document represents a request for a cruise ticket-->

The next line is the root element of the document. There is only one root element per XML document; all other elements in the XML document must be enclosed by the root element. The <ticketRequest> element tells us that this is a user-defined tag with the name ticketRequest and that ticketRequest is in the default namespace. We will look at namespaces later in this hour.

<ticketRequest>

The customer tag contains another string inside its tag. The value of this other string is custID="10003". custID is said to be an attribute of the tag customer. "10003" is called the attribute value.

    <customer custID="10003" >
       <lastName>Carter</lastName>
       <firstName>Joseph</firstName>
   </customer>

The <customer> tag can contain two other tags—the <firstName> and the <lastName> tags. These tags have values that lie outside the delimiters < and >. In reality, the document designer can place data as attributes or as tag values whenever a one-to-one relationship exists. When the relationship is one-to-many, only tag values will work. Notice the use of the corresponding </customer> tag to indicate the end of the <customer> tag.

The <cruise> tag follows a pattern similar to the <customer> tag. If an element doesn’t have any nested elements, you can use the empty tag shorthand notation instead of an opening and a closing tag. Instead of having an <isCommissionable> and then a </isCommissionable> tag, we have only <isCommissionable/>, which means the same thing as the two tags combined. This tag is special in that it can’t contain data. Its presence is sufficient to indicate that a sales commission will be paid to the agency that booked this cruise.

   <cruise cruiseID="3004">
      <destination>Hawaii</destination>
      <port>Honolulu</port>
      <sailing>7/7/2001</sailing>
      <numberOfTickets>5</numberOfTickets>
      <isCommissionable/>
   </cruise>

The closing tag indicates that the document is complete:

</ticketRequest>

Notice how easy it is to understand what the data in the file means. The careful selection of tag names preserves their meaning so that humans, as well as software, can understand the data. Testing XML files is easy to do using either a Netscape or IE browser. All you have to do is create a file with the XML in it, and then open the file by using the Open command on the File menu. If you have any errors in the XML document, they will show up in the browser. Figure 7.1 shows what an error message looks like in Netscape 7.0.

You can use a browser to validate that an XML file is well formed.

Figure 7.1. You can use a browser to validate that an XML file is well formed.

We purposely misspelled one of the tags so that we would generate an error. You will also notice that there is no designation of a DTD or schema in this code. If there had been, the parser run by the browser would have located it and used it to validate that the XML tags were created in obedience with the XML rules.

Notice also that this XML document did not contain any data that is not plain text. The reason for this is to preserve the simplicity of the document. If XML permitted the inclusion of binary data into documents, it would greatly complicate the parsing process and compromise our ability to transfer it between different computers. Integers, real numbers, dates, and times can be created from text strings within programs. By the same token, programs can convert these data types into their textual representation before putting them into XML documents.

Understanding Namespaces

Once you have created a vocabulary of useful elements, you will be reluctant to part with it. The principles of modularity state that you should be able to combine a number of different XML sets of tags together and use them in the same document. A problem arises, however, if you try to combine tag vocabularies with elements or attribute names that are identical. How will the program that receives your XML document differentiate between these different, but identically named elements?

The good solution to this problem would be to prefix every element with a string that is guaranteed to be unique across the whole planet. Using this scheme, two identical elements called <captain> could be differentiated because one of them would be called <abc:captain> and the other <xyz:captain>. Then the only problem would be to figure out a way to keep the authors of the other vocabularies that you use from using the same prefix.

If we were to use a valid URL from an organization that has registered it properly, we could be sure that no two organizations would use the same prefix. Therefore, the tag name of

     <www.samspublishing.com/authoring:captain>

would work. Because this publishing organization is large and others in the same company might use the same element name, it might be more unique if I added the name of my department to the string also. Now, I can be virtually guaranteed that no one outside my own department can create a name conflict. The only problem is the size of the tag. If I have a tag name that is this long, my document will be nearly unreadable. If I could create a string variable called wspa and assign to it the value www.samspublishing.com/authoring, my tag would look like this:

     <wspa:captain>
Understanding Namespaces

This name is much more practical. In fact, this approach is exactly the one employed by XML in a feature called the namespace. Consider the XML in Listing 7.2.

Example 7.2. The TicketRequest2.xml File

<?xml version='1.0' encoding='utf-8' standalone='yes' ?>
<!--This XML document represents a request for a cruise ticket-->

<cust:ticketRequest xmlns:cust='www.samspublishing.com/customer'
               xmlns:boat='www.samspublishing.com/boat'>
   <cust:customer custID="10003" >
      <cust:lastName>Carter</cust:lastName>
      <cust:firstName>Joseph</cust:firstName>
   </cust:customer>
   <boat:cruise cruiseID="3004">
      <boat:destination>Hawaii</boat:destination>
      <boat:port>Honolulu</boat:port>
      <boat:sailing>7/7/2001</boat:sailing>
      <boat:numberOfTickets>5</boat:numberOfTickets>
      <boat:isCommissionable/>
   </boat:cruise>
</cust:ticketRequest>

We defined two namespaces—one called cust and another called boat. The xmlns string is a reserved word in XML that signifies that a namespace is being created. Using these two prefixes, we can guarantee uniqueness even if this document is combined with another. The reason for this is that the parser makes the substitution of the long name for the short whenever the document is processed. The prefix is purely for humans to look at. In fact, the name of the prefix is local to this document.

Note

The TicketRequest2.xml File

Don’t be confused by the use of a URL in the definition of the namespace. Any string can be substituted for this string, but the more unique it is, the better. URLs are the ultimate in unique strings. The parser doesn’t even look at the Web site represented by the URL, even if one actually exists, when processing the document. The unique string is the goal, not a valid address on the Internet.

Understanding the XML Schema

Understanding the XML Schema

Now that we have examined the topic of namespace definition, we can look at how we can create an XML schema for a document. Earlier, we complained that the DTD did not allow us to specify data types well and that it was not written in XML. For these reasons, the W3C has released a new way to specify the legal contents of an XML document called the XML schema.

An XML schema is an XML file that performs the same function as a DTD, but the schema does it better. XML schemas allow you to specify not only the elements and attributes, but also the range of values and the data type of an element. Listing 7.3 shows a schema for a ticket request.

Example 7.3. The TicketRequest.xsd Schema File

<?xml version='1.0' encoding='utf-8' ?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
               xmlns:cruise="http://www.samspublishing.com/"
               targetNamespace="http://www.samspublishing.com/"

<xsd:annotation>
   <xsd:documentation xml:lang="en">
    This XML Schema document represents
     a request for a cruise ticket
   </xsd:documentation>
</xsd:annotation>

<xsd:element name="cruiseTicket" type="cruise:CruiseTicketType"/>

<xsd:complexType name="CruiseTicketType">
   <xsd:sequence>
      <xsd:element name="customer" type="cruise:CustomerType"/>
      <xsd:element name="cruise" type="cruise:CruiseType"/>
   </xsd:sequence>
</xsd:complexType>

<xsd:complexType name="CustomerType">
   <xsd:sequence>
      <xsd:element name="lastName" type="xsd:string"/>
      <xsd:element name="firstName" type="xsd:string"/>
   </xsd:sequence>
   <xsd:attribute name="custID" type="xsd:positiveInteger"/>
</xsd:complexType>

<xsd:complexType name="CruiseType">
   <xsd:sequence>
      <xsd:element name="destination" type="xsd:string"/>
      <xsd:element name="port" type="xsd:string"/>
      <xsd:element name="sailing" type="xsd:date"/>
      <xsd:element name="numberOfTickets" type="xsd:positiveInteger"/>
   </xsd:sequence>
   <xsd:attribute name="cruiseID" type="xsd:positiveInteger"/>
</xsd:complexType>

</xsd:schema>

The first thing that you notice about a schema file is that it is a regular well-formed XML file. The tags in the file have a prefix of xsd, which means that they are part of the XML schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"

We define our own namespace called cruise:

                xmlns:cruise="http://www.samspublishing.com/"

We also define a targetNamespace, which is the namespace that must be used in an XML file if it is going to refer to this schema. This is the namespace that will appear within the xmlns: tag in the header:

                 targetNamespace="http://www.samspublishing.com/"

The annotation and documentation allow comments to be placed in the schema to help the reader understand it:

<xsd:annotation>
   <xsd:documentation xml:lang="en">

The basic job of the schema is to define types, and then define elements of the new type that can appear in XML documents:

<xsd:element name="cruiseTicket" type="cruise:CruiseTicketType"/>

The types are defined in this file also. The complex types are those that contain other complex and simple types:

<xsd:complexType name="CruiseTicketType">

The sequence tag indicates that the order of the elements in the complexType must be followed:

   <xsd:sequence>

The CruiseTicketType is composed of two other complex types:

      <xsd:element name="customer" type="cruise:CustomerType"/>
      <xsd:element name="cruise" type="cruise:CruiseType"/>

The CustomerType and the CruiseType are complex types, but they are made up entirely of simple types. Notice that these simple types are of a variety of different data types. Notice also that the attributes are defined alongside the elements:

<xsd:complexType name="CustomerType">
   <xsd:sequence>
      <xsd:element name="lastName" type="xsd:string"/>
      <xsd:element name="firstName" type="xsd:string"/>
   </xsd:sequence>
   <xsd:attribute name="custID" type="xsd:positiveInteger"/>
</xsd:complexType>

<xsd:complexType name="CruiseType">
   <xsd:sequence>
      <xsd:element name="destination" type="xsd:string"/>
      <xsd:element name="port" type="xsd:string"/>
      <xsd:element name="sailing" type="xsd:date"/>
      <xsd:element name="numberOfTickets" type="xsd:positiveInteger"/>
   </xsd:sequence>
   <xsd:attribute name="cruiseID" type="xsd:positiveInteger"/>
</xsd:complexType>

Listing 7.4 shows an XML file that conforms to this schema.

Example 7.4. The TicketRequest3.xml File

<?xml version='1.0' encoding='utf-8'?>
<acruise:cruiseTicket xmlns:acruise ="http://www.samspublishing.com"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://www.samspublishing.com ticketRequest.xsd">

   <customer custID="10003" >
      <lastName>Carter</lastName>
      <firstName>Joseph</firstName>
   </customer>
   <cruise cruiseID="3004">
      <destination>Hawaii</destination>
      <port>Honolulu</port>
      <sailing>2001-07-07</sailing>
      <numberOfTickets>6</numberOfTickets>
   </cruise>
</acruise:cruiseTicket>

We first define a namespace and a prefix to identify the elements. Notice that we use the targetNamespace that was defined when we defined the CruiseTicket element:

<acruise:cruiseTicket xmlns:acruise ="http://www.samspublishing.com"

We next declare that this file conforms to an instance of the 2001 XML schema specification:

  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

We associate the namespace string with a schema filename. The namespace string must match the one in the targetNamespace element in the schema file:

  xsi:schemaLocation="http://www.samspublishing.com/ticketRequest.xsd">

We now create elements according to the complex type definitions and data types described in the schema file:

   <customer custID="10003" >
      <lastName>Carter</lastName>
      <firstName>Joseph</firstName>
   </customer>
   <cruise cruiseID="3004">
      <destination>Hawaii</destination>
      <port>Honolulu</port>
      <sailing>2001-07-07</sailing>
      <numberOfTickets>6</numberOfTickets>
   </cruise>

If we submit this file to a validating parser, it will check that every rule of XML and schema conformity is followed.

Summary

This hour has introduced you to the basic concepts behind XML. You first looked at the motivation for using XML. Following that, you learned the grammar rules for creating XML documents.

Next, you learned how to use namespaces to avoid element-naming conflicts. In the final section, you learned how to validate the correctness of an XML document using XML schemas.

Q&A

Q

What was the primary motivation behind the creation of XML?

A

The primary goal was to create a way to transfer data in character form along with information about its meaning.

Q

Why is an XML schema considered better than a DTD?

A

An XML schema allows the creator to specify exactly what kind of data can appear in the document. A DTD is more limited in this area.

Q

Why is XML limited to text?

A

Every brand of computer can exchange text files with every other brand using software that is commonly available. Other data formats are not always easy to transfer and require special software.

Workshop

The following questions and activities will allow you to test your understanding of XML, namespaces, and XML schemas.

Quiz

1.

What is the purpose of XML?

2.

Why is an XML schema considered a superior way to validate a document?

3.

What is the use of the URL in the definition of a namespace?

Quiz Answers

1.

It allows the meaning of data to be communicated in the same document with the data.

2.

An XML schema can validate an XML document as well as a DTD can; plus, it can assign more specific data types to each field. In addition, it is an XML document itself.

3.

The URL that you normally see is really just a unique string. URLs are chosen because they are guaranteed to be unique. The XML parser doesn’t actually access the URL.

Activities

1.

Create a simple XML schema that describes a business entity in your organization.

2.

Create an XML document that conforms to your schema.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.132.97