Chapter 2. XML Syntaxes

So far, we have not discussed anything regarding the creation and viewing of XML documents. In other words, we have depicted simple XML documents. However, we have not seen how to create them and how to display the contents of such XML documents. The answer to that question is simple. We know that, in its simplest form, a Web browser (such as Microsoft Internet Explorer, Mozilla Firefox, or Netscape Navigator) is an HTML interpreter. In other words, given an HTML file, a Web browser can understand its contents and can display them on the screen. Modern versions of these browsers are now XML interpreters as well. That is, a Web browser can display an XML document just as it would display an HTML file.

Regarding the creation of an XML document, the answer is again quite straightforward. An XML document is a plain text file. Therefore, any text editor can be used to create an XML document. Regardless of the editor used, we should save the file as .xml. The simplest of the editors on Windows can be Notepad and Wordpad. For other operating systems, the reader needs to consult the appropriate documentation.

There are more sophisticated editors as well. Perhaps the simplest of them is Textpad. This is available as a free download from the Internet. However, this is still a general text editor. Probably the world's most popular XML editor is Altova XMLSpy. This is also available as a free download over the Internet. However, our recommendation is not to use a sophisticated editor such as XMLSpy to start with. This is because we know that the best way to learn anything is to try things out ourselves, make mistakes in the process, and correct them. However, specialised tools tend to take away some of this struggle, and instead, help us in preventing these mistakes in the first place. This is quite acceptable when we are in a hurry to develop something for practical usage. However, for a learner, this is not quite the case. Instead, the more hands-on work one does, the better.

Therefore, let us create an XML document using the humble Notepad editor. For this, start the Notepad editor on your computer and type in the text as shown in Figure 2.1.

Figure 2.1 Creating an XML document in Notepad

Now save this file with a name books.xml. After that, open the Web browser and try to open the books.XML document in the browser as shown in Figure 2.2.

Figure 2.2 Result of trying to open the XML document in the Web browser

As we can see, there is an error. This error is in the line:

<?xml-stylesheet type="text/xsl" href="books_list.xsl"?>

More specifically, we are asking the Web browser to find a file by the name books_list.xsl, and use it. The reason behind using this file would be clear later. For now, we need not worry about it. However, what should concern us is how to remove this error. In this case, let us not waste too much of time, and simply remove the erroneous line from our XML document. Hence, our modified XML document in Notepad should look as shown in Figure 2.3.

Figure 2.3 Modified XML document

Now try opening the modified XML document in the Web browser. The result is shown in Figure 2.4.

Figure 2.4 Opening an XML document in a Web browser successfully

We will notice the following characteristics of this display:

The display does not have specific formatting, unlike HTML. All the XML elements seem to get displayed in a similar manner.
The hierarchy of data items is preserved. For example, a hyphen symbol is provided to the left of all the start element tags. We can expand the details of an element or compress them, by clicking on this hyphen. Figure 2.5 shows an example.

Figure 2.5 Hiding the details of an element

Notice that the hyphen next to the second BOOK element is now changed to a + symbol. This is because we had clicked on the hyphen to indicate that we wanted to hide or compress information about this element. If we click on the + symbol again, we would see an expanded version of this element, as shown in Figure 2.6.

Figure 2.6 Expanding the hidden details of an element

Just for the sake of completeness, let us also see how Microsoft's Internet Explorer browser displays our XML document, as shown in Figure 2.7.

Figure 2.7 Displaying an XML document in Microsoft Internet Explorer browser

As we can notice, the display of the XML document is similar in all browsers.

Let us also try out another situation. Let us change the element name from </BOOK> to </BOO> deliberately for the third element name. This is an example of introducing an error in our XML document, and just an example. Any such other error will also do. Figures 2.8 and 2.9 show the resulting output errors in the Firefox and Internet Explorer browsers, respectively.

Figure 2.8 Error in XML—Firefox browser

Figure 2.9 Error in XML—Internet explorer browser

This should tell us clearly that Web browsers validate the correctness of XML documents before they display them. If an error is found, the browsers do not ignore them and proceed. Instead, they trap the error and wait for the user to correct it. We shall study later that this comes in two respects. For now, we will not worry about it beyond this. However, for understanding the basics better, let us have a few simple exercises.

Exercise 1: Try creating an XML document without the first line (that is, the XML declaration). What is the result?

Solution 1: Ideally, this should cause an error. However, surprisingly, the browsers ignore this and display the XML document correctly. We should, nevertheless, always include the XML tag.

Exercise 2: Try having two root elements in an XML document. For example, add one more BOOKS root tag immediately after the original BOOKS root tag to our example, and also the corresponding end tag at the end, that is, </BOOKS>. Study what happens.

Solution 2: There is no problem here as well, since the first of these two consecutive root elements would be considered as the root element. The other <BOOKS> element would be considered as a sub-element of this root element.

2.2 THE IDEA OF MARKUP

We have seen quite a few XML examples thus far. We would have noticed that in all these examples, there are tags containing values. These tags specify certain rules. Therefore, we can also say that XML is nothing but a set of rules. These rules specify how the contents of an XML document can be broken down into parts and sub-parts.

Another name for tags is markup. Hence, we also call XML as a markup language.

HTML is also a markup language. HTML also specifies tags that allow us to classify and sub-classify text. However, as we have seen, the purposes of XML and HTML are different. Whereas HTML aims at classifying text for display purposes, XML is used for classifying text for information storage, classification, and retrieval purposes.

XML has been written in such a manner that it can be extended easily, depending on the business domain, particular sets of requirements, or technology. Hence, together we have the name Extensible Markup Language.

XML is based on yet another language, called as Standard Generalised Markup Language (SGML). HTML is also based on SGML. Thus, we can draw a tree as shown in Figure 2.10, keeping in mind that this is only a portion of the tree.

SGML is the parent of almost all important modern markup languages. However, it does not itself have too much of value as a markup language. This means that we cannot use SGML for creating content easily, as we would do by using HTML or XML. However, SGML works best when we want to create another language based on it, which would create the actual content. Thus, HTML and XML are good at creating content; and are based on SGML. However, we would rarely, if at all, see content created in SGML itself.

Figure 2.10 Positioning of SGML, HTML, and XML

What does the markup portion of XML mean to us, then? It brings the following features of value:

Easy to read for humans
Easy to use
Easy to use for computers
Easy to debug
Easy to modify suitably for any industry or domain
Works with all leading programming languages, databases, and formats such as spreadsheets and drawings

We have mentioned that XML is based on SGML.

SGML was created to provide a means for identifying the portions and content of a document, not by line numbers or the actual content, but by the type of the content.

Since XML is a child of SGML, it serves the same purpose with a lot more focus. That is, we use XML for identifying portions of a document. An important aspect here is that this identification of the portions of a document is not based on things such as line numbers, or the actual content of the document. Instead, it depends on the type of the content. This is a key concept in understanding XML. As an example of this point, we can search an SGML (and hence, XML) document to search for all the instances of a tag such as, say, BOOK, and replace it with something else. In another case, we can look for all the H2 tags in an SGML/XML document, and specify that the contents of this tag should be displayed in a specified font, colour, and font size.

Well, the story does not end here! SGML, in turn, is based on yet another language called as General Markup Language (GML). GML was developed at IBM in the 1960s. SGML was not popular, and was sparingly used, until it became a standard 26 years later. The US Internal Revenue Service (IRS) and the US Department Of Defence (DOD) accepted SGML as their official language and demanded that all their vendors and contractors use SGML. This gave a tremendous fillip to SGML. Soon, the Internet evolved and became hugely popular. The creators of the Web thought about using SGML as the base for HTML, a step that was to become a success. Therefore, when XML was evolving, SGML was a natural choice for its development.

We can say that SGML is a language of languages.

In other words, these days, we would not use SGML as the language for creating or exchanging documents. Instead, we would use SGML to create other languages, such as HTML or XML, which are quite good at creating or exchanging documents.

2.3 XML STRUCTURE

2.3.1 Inverted Tree Structure

It is extremely important to understand how we can create XML documents in terms of its internal structure. That is, we must have a good grasp on the way we can organise the contents of an XML document. We need to remember that XML organises information in a hierarchical manner. For example, the book that you are reading now has also organised information as a hierarchy of textual contents. In this case, the hierarchy consists of chapters, each of which consist of sections, each of which consist of paragraphs, each of which consist of sentences, and so on. We can organise this information diagrammatically as shown in Figure 2.11.

Figure 2.11 Concept of hierarchy of information

We should be able to break down or transform information into such a hierarchy when we want to work with XML. Given any form of text, organised in any manner, we should be able to break it down into a hierarchy as shown.

An XML document is actually similar to an inverted tree.

For example, we can represent the hierarchy of the contents of this book as shown in Figure 2.12. Due to space constraints, we are showing only a few samples. But that should not cause problems in the understanding of the concept. We will soon compare this with the way XML would look at these contents, to verify that it views any sort of content as an inverted tree structure.

Figure 2.12 Book contents as a hierarchy (inverted tree)

Let us now see how XML looks at the same structure, as shown in Figure 2.13.

Figure 2.13 Book contents as an XML document

We can see that the XML document organises information in a manner that is remarkably similar to that of the inverted tree structure shown earlier. This should give us a good idea that XML is good at representing information as a hierarchy of elements.

2.3.2 Creating Layers of Elements

XML elements can contain data, other sub-elements, or nothing. Examples of this are shown in Figure 2.14.

Figure 2.14 Possibilities about the contents of an element in an XML document

This also means that the same content can be represented differently in different XML documents. For example, two ways to represent the same content in an XML element are shown in Figure 2.15.

Figure 2.15 Layering of elements

We can see that without a layering of elements, there is no provision for us to directly refer to individual elements (as seen in the first part of the diagram). There is just one element (named BOOK in this XML document). The second diagram identifies the Title and Author elements individually. This ability of addressing individual elements in an XML document really gives the power that XML claims to have. In other words, we should use layering of elements and sub-elements to our advantage. We have to also ensure, though, that we are not overdoing it.

In a way, this is similar to the discussion about database normalisation. We know that an un-normalised database is not the easiest to work with. Similarly, if we do not break down the contents of an XML document into elements, sub-elements, and attributes in a proper manner, it may become difficult to work with that XML document.

Interestingly, this is not the only way to ensure accessibility of individual portions of an XML document. We could have done away with the TITLE element, and retained the ability to access the details of the XML document. This is shown in Figure 2.16.

Figure 2.16 Removing the TITLE element

In this case, although we have removed the TITLE element from the book information, we have not lost the contents of the title itself. Now, the BOOK element contains text (representing the title of the book) and another element, called as AUTHOR. If we want to search for the title of the book, we will need to look at the text content of the BOOK element now. We should ignore the child elements of the BOOK element.

The point is that XML does not mandate that we use a specific layering or organising strategy for elements. We are free to choose our design, keeping in mind our requirements. It all boils down to how much flexibility we want, and what sort of challenges we want to resolve.

2.3.3 Comments in XML

We must be familiar with the reasons why we need to have good comments in our code and documents.

The syntax for comments in XML is exactly the same as for HTML. That is, the comments should be enclosed inside the tag boundaries .

Consider the example shown in Figure 2.17.

Figure 2.17 Comment in XML

The output of this XML document in the Web browser is shown in Figure 2.18. Notice that the comment portion is ignored, that is, treated differently, by the Web browser.

Figure 2.18 XML comments are ignored

Here, the comment purely describes something that we want to state. Comments have another purpose, too. They can be used to temporarily hide or conceptually disable the portions of an XML document that are not in use currently. For example, let us imagine that we are debugging the contents of an XML document with reference to some problem. We can comment out the portions of the XML document that are not needed in the current context. An example of this situation is depicted in Figure 2.19.

Figure 2.19 Use of comments to temporarily hide information

Certain restrictions apply when using comments. They are as follows.

XML comments cannot appear before the xml tag. In other words, the xml tag must be the first statement for a XML document to be valid. Figure 2.20 shows an example of what is not allowed.

Figure 2.20 Comments cannot appear before the xml tag

The error displayed by the Web browser is shown in Figure 2.21.

Figure 2.21 Error encountered by adding a comment before the xml tag
Comments in XML cannot break an element. In other words, an XML comment should not cause an element to have a start tag but no corresponding end tag, or an end tag for which there was no start tag. An example of this error is shown in Figure 2.22.

Figure 2.22 Comments cannot break elements

The resulting error screen is shown in Figure 2.23

Figure 2.23 Result of a comment breaking elements
A comment cannot be inside an XML element declaration. Figure 2.24 shows an example of this type of problem.

Figure 2.24 A comment cannot appear inside an element declaration

The resulting error screen is shown in Figure 2.25.

Figure 2.25 Problem caused by adding a comment inside an element declaration
Finally, comments cannot be nested. An example of this situation is shown in Figure 2.26.

Figure 2.26 Nesting of comments is prohibited

The resulting error is shown in Figure 2.27.

Figure 2.27 Error when nested comments are found in an XML document

2.3.4 Order of Elements

Relational Database Management Systems (RDBMS) do not attach significance to the order of rows in a table. Data can be present in a table in a sorted or unsorted form. That is not relevant to the designer, programmer or user of the database. At run-time, we can control the ordering of rows by using the ORDER BY clause. For faster access, indexes can be created to make ordering of data simpler. This concept is shown in Figure 2.28.

Figure 2.28 The order or sequence of data in an RDBMS table does not matter

The same concept applies to XML documents.

Ordering or sequencing of elements has no relevance in XML documents.

XML elements can be in any order. They need not be stored in a particular sequence. The ordering is not on the basis of the contents of elements, attributes, or anything else. Whenever we need to display the contents of an XML document in a particular sequence, we can use the technology of XML Stylesheet Language (XSL) to do so. We will simply introduce this concept here, and elaborate it when we discuss XSL later in detail. This is shown in Figure 2.29.

Figure 2.29 Order of data in an XML document does not matter

As we can see, there is no sequence inherent in the XML document. However, we can easily sort and group the contents on the basis of the EMP_DEPT element and present it in a format as shown in the figure. This trick can be accomplished by XSL. The original organisation of the XML document in terms of its actual contents does not matter.

2.4 ORGANISING INFORMATION IN XML

Designing an XML document is similar to designing a database table. We think about things such as normalisation, redundancy, etc., while designing a database table. While designing an XML, we do not speak in these terms, but the ideas are fairy similar. We can break down this process of organising information into an XML format in three steps, as shown in Figure 2.30.

Figure 2.30 Steps in organising information into XML format

Let us discuss these steps now.

2.4.1 Classifying Information as per its Importance

In any application or a given situation, we will always find that some data is more important than the rest (taking a clue from George Orwell's famous “All animals are equal, but some are more equal”!). For example, in a banking application, the data about account holders and their transactions is perhaps the most crucial piece of data. In an inventory management application, the information about products and their current stock details is very important. It is needless to say that we are not questioning the importance of other pieces of information in these applications. It is just that those are not the main aspects of that application.

Using these ideas as the basic clues, let us identify the first steps in transforming some piece of information into an XML format. For this purpose, perhaps the simplest trick is to classify information into various categories, such as primary, secondary, and tertiary. For instance, information that we think is the most crucial piece should be classified as primary. The other two categories would contain information that is not the most significant, but is yet relevant, in a decreasing order of importance. This is best illustrated with an example.

Consider that we need to represent information about books in an XML document. We may like to track information about a number of aspects of a book. These are listed in Figure 2.31.

Let us now think about the importance of the various aspects of a book, from the list drawn above. Of course, some of the logic that is applied here while deciding the importance, and hence classifying the details, may vary from one situation to the other. However, we are generalising those possibilities to come to a common conclusion. Based on this theory, we could come up with the classification of book details as shown in Table 2.1.

Primary information	Secondary information	Tertiary information
Title	Publishing year	Reprint number
Author	Number of pages	Number of illustrations
Publication	Foreword by	Bookstores where the book is available
Price	Image of the book cover	Online bookstores where the book is available
Edition	Long description	Other books by the same author
Book Website	Reviews	Other books on the same subject
Short description	Any competing titles

Table 2.1 Classifying information as per its perceived importance

Figure 2.31 Information about a book that we would like to capture

2.4.2 Adding the Details

Now that we have a fair idea of what we want to store in terms of book information, the next step is to add more details to the current understanding. These details will again vary from one situation to another. However, the common details need to be captured and depicted here. This exercise, like the previous one, may seem to be similar to the steps in designing a database. Even there, we decide what information needs to be captured, and how, step-by-step.

For instance, what information would we be interested in capturing as a part of the author's name? At the minimum, it could be the first and the last names; and if we want to be really detailed, it would also include the middle name. Similarly, the publishing company's details could include just the name, at the very least, and its complete address at the other extreme. We need to know these details, because that would determine exactly how our XML is going to be designed.

We can conduct this exercise for all the three categories of information (that is, primary, secondary, and tertiary). However, that may be unnecessary in many situations. Instead, it may be quite sufficient to restrict this to the primary information. As a result, our expanded table for primary information would look as shown in Table 2.2.

Primary information	Details we want to capture	Details we may want to ignore
Title	Main title Sub-title, if any	None
Author	First and last name	Full name Affiliations
Publication	Name of the publisher	Full address Web address Contacts
Price	In local or regional currency	In more currencies
Edition	Number	None
Book Website	URL	None
Short description	Description in English	Description in other languages

Table 2.2 Capturing more information about the primary information

2.4.3 Transforming Information into an XML Format

At this stage, it is recommended that we transform our information into an XML format. This is relatively simple. We need to find out what are the elements that we would like to see in our XML document first, followed by the attributes. This is shown in Figure 2.32.

Figure 2.32 Process of transforming information into XML format

Let us discuss these points now.

Identifying elements

We know that elements are the backbone of an XML document. The contents in an original form are essentially transformed into elements whenever we think about XML. Without elements, there is barely anything else in XML. While identifying elements, we need to be sure that we are not losing crucial pieces of data and, simultaneously, that our XML document provides for future expansion. More specifically, we need to be sure that the design of elements does not in any manner constrain future additions. Again, this is similar in concept to identifying the possible columns in a database table. The idea there is also to capture the current information while providing for possible future expansion.

There can be many ways to systematically identify elements. However, the simplest mechanism could be a two-step process: (a) Write the hierarchy of elements as we would like to see them in the final XML document, and (b) Create an empty element structure similar to how it should look like in the final XML document.

Figure 2.33 depicts the hierarchy of elements for our books example.

Figure 2.33 Hierarchy of elements for books example

Now it is fairly easy to convert the hierarchy of elements into an XML-like structure. All we need to do is to transform the visual form of the hierarchy into an XML-like syntax. The resulting structure is shown in Figure 2.34.

Figure 2.34 Transforming elements hierarchy in an XML format

Of course, this is not the only way in which we can capture or represent the information in an XML format. The same information can be organised in various other XML formats. This will depend entirely on the situation in terms of how we want the information to be accessible to the application programrs, as well as to the end users. The convenience, as well as the possible usages, of the information captured in the XML document will determine this representation.

For example, we can flatten the structure of our XML document by removing the intermediate level of hierarchy. The resulting XML document would be as shown in Figure 2.35.

Figure 2.35 Flattened XML structure for the books example

There would be quite a few other manners in which the same information can be represented. This concept should be quite clear by now.

Exercise: Consider that we want to capture information about a salesperson. Capture this information in an XML format.

Solution: There would be many ways to do this. One possible format is as shown below.

<SALESPERSON_CONTACT>

<BUSINESS_INFO>

<WORK_TELEPHONE> </WORK_TELEPHONE>

<CELL_PHONE> </CELL_PHONE>

<WORK_EMAIL> </WORK_EMAIL>

</BUSINESS_INFO>

<PERSONAL_INFO>

<HOME_TELEPHONE> </HOME_TELEPHONE>

<PERSONAL_EMAIL> </PERSONAL_EMAIL>

</PERSONAL_INFO>

</SALESPERSON_CONTACT>

Identifying attributes

Primary pieces of information are captured as elements. However, this is not always sufficient. Many times, we want to add more specific details to these elements. There are two ways to handle this situation. We can either divide the elements into sub-elements, or add attributes to the main elements. As an example, let us suppose we want to add information about the category to which our book belongs. The resulting change is shown in Figure 2.36.

Figure 2.36 Adding category information to a book

How would we now add this information to our earlier XML structure? As mentioned earlier, there are two primary ways to do this. We can simply add an element called as <BOOK_CATEGORY> to our XML document as shown in Figure 2.37.

Figure 2.37 Adding the category element to a book

As we can see, <CATEGORY> is added as a sub-element to our XML structure. However, there is another approach to handling this case. We can make category as an attribute of the one of the suitable elements, for instance, that of the TITLE. This approach is illustrated in Figure 2.38.

Figure 2.38 Adding the category attribute to a book

We will discuss more details about attributes, especially with relation to the design aspects, later.

2.5 CREATING WELL-FORMED XML DOCUMENTS

In this section, we shall discuss a few important technical details about the syntax of XML. This will enable us to know how to create correct XML documents.

2.5.1 The <?xml> Tag and the Root Element

<?xml> tag

The <?xml> tag identifies our document as an XML document. This must always be the first statement in an XML document. As we know, it specifies the version of the XML specifications it is following. For example, we write this tag as follows:

<?xml version="1.0"?>

There are a couple of attributes that we can specify along with the <?xml> tag. These attributes provide additional information about this tag. The standalone attribute specifies whether a given XML document is self-sufficient in all respects, or whether it has to depend on an external document (such as a DTD for validating the contents of the XML document). This is described below.

<?xml version="1.0" standalone="yes"?>

In this case, we specify that our XML document is self-sufficient. It means that the XML processor need not worry about bringing an external file for any purpose. Such an XML document is called as a standalone XML document.

<?xml version="1.0" standalone="no"?>

In this case, we want to specify that our XML document depends on an external reference, for instance, a DTD, for validation).

One interesting point needs to be noted. Regardless of whether we specify the value of the standalone attribute as yes or no, the processing of the XML document remains quite unchanged! That is, the XML processor continues to process the XML document without any regard to this.

By default, the standalone attribute is considered to contain the value no. The usage of the yes value will be discussed when we cover DTDs.

The other attribute of significance is that of the character encoding.

Character encoding allows us to specify the language based on the ISO standards or Unicode standards, which is used to create the markup and the contents of the document.

We can specify this attribute even in the case of HTML files. The Internet Assigned Numbers Authority (IANA) governs the possible values of this attribute. Some of the character encodings are listed in Table 2.3.

Character encoding	Purpose	Language
US-ASCII	7-bit ASCII, covers the first 127 characters of the ASCII code	English
UTF-8	This is an optimised version of Unicode. It uses just one byte if the character belongs to the first 127 ASCII values. Otherwise, it uses three bytes for a character.	Compressed Unicode
UTF-16	The Universal Character System (UCS) uses four bytes to represent one character.	Compressed UCS
ISO-8859-X (X is a number between 1 and 15)	Represents ASCII plus one more language, for example, (a) ISO-8859-1 covers ASCII and all the western European languages, (b) ISO-8859-6 is ASCII and Arabic, etc.	Various
ISO-2022-JP	Japanese characters	Japan
KOI6-R	Russian alphabets	Russian
ISO-2022-KR	Korean characters	Korean

Table 2.3 Character encoding values in XML (Incomplete)

The root element

We have mentioned earlier that every XML document must have exactly one root element. All the other contents of the XML document are a sub-part of this root element. Therefore, we can show the structure of an XML document with reference to the root element as depicted in Figure 2.39.

Figure 2.39 Root element concept

Note that the root element must always be the first element immediately after the <?xml> tag.

2.5.2 Opening and Closing of Tags

We need to remember the following:

All elements have an opening tag. Optionally, elements also have a closing tag.

The opening and closing tags signify the position of an element in an XML document. For example, if the closing tag of an element immediately follows the opening tag, it means that this element does not, in turn, contain sub-elements. On the contrary, if we have more elements inside the opening and closing tags of an element, it means that this element contains one or more sub-elements. This concept is illustrated in Figure 2.40.

Figure 2.40 Elements and sub-elements

We also need to reiterate that an element opening tag is of the form <>, whereas an element closing tag looks like </>.

2.5.3 Empty Elements

There are two ways in which an empty element in an XML document can be depicted. We can either use the tag pair <> and </> containing the element name to depict this, without content in between. Alternatively, we can do away with the <> tag and just use the </> syntax. This is shown in Figure 2.41.

Figure 2.41 Two ways of representing an empty element in XML

Let us explain this with a simple exercise.

Exercise 1: Show a customer name including the first and the last name, but the middle name should be empty. Use the <> and </> syntax.

Solution 1:

<NAME>

<LAST>Kahate</LAST>

</NAME>

Exercise 2: Now show the same information by using just the </> tag for the empty middle name.

<NAME>

<LAST>Kahate</LAST>

</NAME>

2.5.4 Entities

An entity in XML represents a text that you want to use repeatedly without having to write it every time. Instead, you define it at one place, and refer to it from other places. It provides a convenience in terms of reducing the effort of writing it every time. For example, an organisation usually has a logo and a mission statement. Rather than specifying these on every Web page on the website of the organisation, we can create them only once and refer to them from every Web page.

There are three types of entities in XML: character, text, and binary. This is shown in Figure 2.42.

Figure 2.42 Types of entities in XML

We will now discuss these types.

2.5.4.1 Character entities

Those familiar with programming languages such as C, C++, C#, and Java should be familiar with the concept of escape sequences. Put simply, an escape sequence allows us to assign a different meaning to a pre-defined reserved text from what appears to be its meaning. A similar concept in XML exists in the form of entities.

Character entity references are special character codes that assign a different meaning to a special symbol.

Character entities allow us to replace a pre-defined (that is, fixed) text with something else.

For example, we know that an element name is embedded inside the tags < and >. For example, <NAME> is an element. Thus, in XML, there is a specific meaning associated with the tags < and >. They indicate the start and end of an XML element. But then, what if we want to use these symbols as values? That is, we do not want the XML processor to consider < as the start of an element, and actually consider it to mean less than (that is, <)? In all such cases, we need entity references.

In XML, there are five character entity references, as shown in Table 2.4.

Character entity	Meaning
&	Represents the ampersand (&) character
'	Represents the apostrophe (') character
>	Represents the greater than (>) character
<	Represents the less than (<) character
"	Represents the quotation mark (“) character

Table 2.4 Character entity references

We need to note the following points about the character entities.

The syntax for writing character entities is always &, followed by the name of the entity, and terminated with a semi-colon. This idea is depicted in Figure 2.43.

Figure 2.43 Character entity syntax

An example of using an entity is shown in Figure 2.44.

Figure 2.44 Example of using a character entity

2.5.4.2 Text entities

Text entities take the concept of character entities further. They allow us to define a (possibly large) piece of data as an entity. In the actual XML document, we then use this entity name, rather than the original text.

Usually, text entities are used to associate large or repeated blocks of text with a name and replace the text with the entity name.

The syntax for declaring a text entity is as shown in Figure 2.45.

Figure 2.45 Text entity declaration syntax

An example of declaring a text entity is as shown in Figure 2.46.

Figure 2.46 Text entity declaration example

Clearly, whenever we use the entity country later in the XML document, we will see the entity being replaced with its value, that is, India. We will illustrate this later when we study DTDs. For now, the result based on our conceptual understanding is as shown in Figure 2.47.

Figure 2.47 Entity example

2.5.4.3 Binary entities

Binary entities are used to associate a name with binary data (such as an image or a video) and use the entity name instead of the actual binary data.

Binary entities contain data that is of type other than text and markup. Importantly, because binary entities can define many forms and types of data (for example, image, audio, video), we must associate this type with the entity name. For example, if a binary entity is being defined in the place of an image, we must declare that whenever we actually want to replace this binary entity name with an image, the XML processor must try and interpret the contents of the image as an image (and not as audio or video, for instance). Otherwise, the XML processor may not know how to interpret the contents of a binary entity.

Figure 2.48 shows an example.

Figure 2.48 Binary entity example

The explanation of this binary entity declaration is as follows:

Keyword SYSTEM indicates the type of the DTD. We have not discussed this yet, and it is not relevant to our current discussion. Therefore, we will ignore it. We will revisit it when we discuss DTDs. The remaining portion of the entity declaration states that CITY is an entity declaration. It is binary because of the usage of the keyword NDATA, which indicates that this is “not XML data”. The actual contents of this entity are defined in a separate file, called delhi.html. That is, wherever a reference is made to the entity with the name city in our XML document, we want to open this HTML file and dump its contents in place of the entity named city. Finally, html indicates that the type of this entity is HTML. In other words, we want the Web browser to handle this file. So, when this XML document is being processed, the default Web browser would be started and the contents of the delhi.html file would be opened inside the browser. This explanation is shown diagrammatically in Figure 2.49.

Figure 2.49 Binary entity example explained

We will discuss entities further in the context of DTDs.

2.5.5 Element Naming and Nesting Conventions

Naming elements

There are certain rules and conventions that we need to follow while defining elements. These rules are summarised in Figure 2.50.

Figure 2.50 Element naming rules

These are some of the allowed element names:

<Name05>
<NAME05>
<Name.05>
<_05Name>
<name05>
<name_05>

The following are invalid element names, for the stated reasons.

<Project=05> Containsanequalto(=)sign
<PROJECT:05> Containsacolon(:)notatthebeginning
<Project5> Containsaspace
<Project%05> Containsapercent(%)sign
<05project> Startswithanumber
<.project.05> Startswithaperiod(.)

XML element names are case-specific. Thus, <NAME>, <Name>, and <name> are treated as three different elements. The rules of opening and closing tags must be followed precisely. For example, in Figure 2.51 below, case (a) is valid, but case (b) is not. Note that the opening and closing tags do not exactly match in case (b).

Figure 2.51 Opening and closing tags: Element naming conventions

Let us have an exercise to test our understanding of the element naming conventions.

Exercise: Identify which of the following element names are valid and which ones are not. Give reasons for invalid element names.

<empName>	Valid
<emp_name>	Valid
<emp*name>	Invalid: Cannot contain an asterisk (*)
<EMPNAME>	Valid
<EmpName>	Valid
<Emp.Name>	Valid
<.EmpName>	Invalid: Cannot begin with a dot (.)
<_EmpName>	Valid
<–EmpName>	Invalid: Cannot begin with a hyphen (–)

Nesting elements

In XML, child elements must be nested completely inside the parent element.

HTML also allows an element to contain one or more child elements (or sub-elements). However, the nesting of the elements is not strictly followed in HTML. That is, the code piece shown in Figure 2.52 may be allowed in HTML in certain cases. However, XML is strict about this, and does not allow such an overlap of nesting.

Figure 2.52 Overlapped nesting of HTML elements

XML is strict about syntax checking. Hence, it mandates that we cannot allow elements and sub-elements to overlap, unlike HTML. This also makes sense, because remember that the whole focus of XML is on the contents or representation of data elements, and not on its display. Therefore, we cannot allow arbitrary and incorrect overlapping of element tags, unlike in the case of HTML. Of course, this is not to suggest that even HTML should allow it! Even HTML should not allow it, but in any case, XML must not!

Exercise: Correct the following XML in terms of changing the wrong overlapping of elements nesting.

<NAME> Admin </NAME>

<NAME> Sales </NAME>

</LOCATION>

</DEPARTMENT>

</ORGANISATION>

Solution: We can see that there is an incorrect overlapping of the <LOCATION> and <DEPARTMENT> elements. The closing tag </DEPARTMENT> should appear before the closing tag </LOCATION>. Currently, they are in the wrong sequence. Hence, the corrected XML contents would look as follows:

<NAME> Admin </NAME>

<NAME> Sales </NAME>

</DEPARTMENT>

</LOCATION>

</ORGANIZATION>

2.5.6 Adding Attributes

We have discussed the idea of attributes earlier. Attributes allow us to specify more information about XML elements. They are optional, but when used carefully, can add a lot of intelligence to an XML document.

Attributes are widely used in HTML. Here, their main purpose is to contain formatting instructions. That is, in addition to the basic HTML tags, attributes allow us to provide more details about those basic tags. For instance, the HTML <TABLE> tag allows us to create a table. We can specify many more details about such a table in the form of attributes. For example, we can specify the border, colour and background of the table in the form of attributes. This concept is shown in Figure 2.53.

Figure 2.53 Concept of attributes in HTML

Several such examples can be given.

Like HTML, XML also supports the concept of attributes. However, we need to remember the following.

Attributes in HTML are used to control the display or formatting characteristics. Attributes in XML are used for providing more information about certain elements.

Before we discuss anything more about XML attributes, we also need to remember an important basic concept:

Attributes merely provide an alternative to sub-elements.

In other words, we can use attributes in the place of sub-elements, and vice-versa, in many situations. An obvious question then is how do we decide between the two? What are the parameters that help us decide between representing something as a sub-element, and not as an attribute (or vice-versa)? There are certain points that we need to think about, while taking this decision. These observations are merely a result of the possible usages of XML in different situations, and would largely apply. But in a specific situation, they may not prove to be ideal.

Are we trying to represent main data or metadata?
Do we need to extend this further?
Do we need to worry about the structure of this data?
Is there any preference about the display characteristics?

Let us try to find out answers to these questions.

1. Are we trying to represent main data or metadata?

There is always some confusion about what should be called as main data and what is to be termed as metadata. The general definition states that information is main data, and information about information is metadata. Based on this, the general design concept is that all main data should be represented in XML as elements, and all metadata needs to be modelled as attributes.

This brings us to an interesting question: what is main data and metadata in a given context? Well, it depends! Main data in one situation could become metadata in another, and vice-versa! For example, consider the XML element shown in Figure 2.54, which contains an attribute.

Figure 2.54 Main data or metadata?

As we can see, information about a team (India) is provided in the form of an element. However, information about this information (that is, the name and details of the captain) is provided as an attribute. We cannot say for sure whether this additional information about the team (that is, the metadata in the form of the captain attribute) should always be represented like the way it is shown. It would clearly depend on the context. If we are likely to have a situation where we need to find out more information about the captain, we should model it as a sub-element, instead of as an attribute. Figure 2.55 shows this.

Figure 2.55 Changing an attribute to a sub-element

Clearly, this decision depends on a given situation. We cannot apply it in all the situations, but need to think about it in the context of a given frame of reference.

2.6 XML NAMESPACES

Imagine a situation where we are informed that we are going to be visited by a person known to us named Sachin. We would obviously be curious to know who that person is. The simple reason behind this curiosity is some sort of ambiguity. This is because we may be aware of many persons with the same name, Sachin. Therefore, we seek additional information. This is usually in the form of surname. Once we know that this Sachin is Tendulkar, we are clear about which Sachin we are talking about. Thus, the ambiguity is dealt with. We want to discuss a similar situation now.

In the context of XML, we have addressed the problems we encounter in the absence of namespaces. For example, if we do not have context information available, we would not understand the meaning of an XML element named as TITLE. For example, we would not understand if this is the title of a book, movie, person, or something else! Therefore, we need to add more information to this element in terms of qualifying it further. For example, we would need to specify it as BOOK:TITLE, MOVIE:TITLE, or PERSON:TITLE. Now, the ambiguity is gone.

XML namespaces provide us more information about elements in the form of qualifiers, thus removing ambiguities, if any.

We must not confuse the term namespace as used in XML with the same term used in programming languages, such as C++. These have completely different meanings.

Let us understand the usage of namespaces with an actual example. Consider a books library, which maintains information about a book being issued to a member. A sample book issue record in the form of an XML document fragment is illustrated in Figure 2.56.

Figure 2.56 XML document for keeping issued books information in a library

As we can see, three sub-elements repeat in both BOOK_INFO and MEMBER_INFO elements. These three sub-elements are NUMBER, TITLE, and NAME. When we look at the complete XML document, there is no problem in terms of understanding which NUMBER, TITLE, or NAME we are referring to. However, if we are just given an element, without providing information about the whole XML document (which we are calling the context), we would clearly be confused. For example, would a NUMBER refer to the book number or the member number? We simply would not know!

This is where namespaces come to the rescue. They allow us to unambiguously define elements, sub-elements, etc. Using namespaces, we can alter our XML document in such a fashion that we add a prefix to the elements in question. This is shown in Figure 2.57.

Figure 2.57 Use of namespaces

The prefix that we add to an element to make it unambiguous is called as a namespace prefix.

Namespace prefix is completely user-defined. That is, it does not have to follow specific rules, apart from the basic XML rules. Thus, we can use a namespace prefix such as BOOK, BK, etc.

A namespace prefix needs to be tied to a Uniform Resource Identifier (URI). The URI has a name. We are free to use a URI name of our choice.

Let us modify our XML document to add namespace URI and namespace prefixes. This will clarify the terms better. This is depicted in Figure 2.58.

Figure 2.58 Adding namespace prefix and namespace URI

Let us first discuss the namespace URI first. The following lines declare the namespace URI:

<BOOK_ISSUE xmlns:book = “urn:bookDetails”

xmlns:member = “urn:memberDetails”>

Let us analyse this.

BOOK_NAME is the root element of our XML document.
To this root element, we have added two namespace prefixes: book and member. In effect, we are stating that in our XML document, we can prefix our elements and attributes with a prefix such as book or member. In effect, we inform the XML parser that it may see elements and attributes prefixed by book or member.
The declarations urn:bookDetails and urn:memberDetails simply indicate a longer name for the namespace prefix, called as namespace URI. This has nothing to do with the actual elements in the XML document. Remember, they only see the namespace prefix.

This is illustrated in Figure 2.59.

Figure 2.59 Usage of namespace prefix illustrated with an example

As we can see, the namespace prefix is defined in the root element, and used inside the elements describing data. Thus, there is no confusion about an element called as NUMBER now, for example. We clearly have BOOK:NUMBER or MEMBER:NUMBER.

Many times, instead of providing a name to the namespace URI, a Uniform Resource Locator (URL) is used. As we know, a URL points to a unique file on the World Wide Web (WWW). For example, www.yahoo.com/index.html is an example of a URL. This URL points to the index.html file on Yahoo's website.

We can specify any URL. It need not be valid. It is never tested. Therefore, we can specify our URIs in the form of URLs as shown below.

<BOOK_ISSUE xmlns:book = “http://www.dummysite.com/myURL–book”

xmlns:member = “http://www.dummysite.com/myURL–member”>

To reduce the number of places where the namespace prefix needs to be used, we can specify a default namespace. A default namespace can be specified for an element and all its sub-elements. For this, we need to specify a namespace without a prefix (that is, just use the keyword xmlns without a prefix). For example, consider the following declaration.

<BOOK_ISSUE xmlns = “http://www.dummysite.com/myURL–book”

xmlns:member = “http://www.dummysite.com/myURL–member”>

This means that all the elements (and their sub-elements at any level) that do not have any namespace prefix attached in the XML document will automatically get a default namespace prefix associated with them. Clearly, in this case, we want to use the default namespace for the book element and all its sub-elements.

Thus, our modified XML document will look as shown in Figure 2.60.

Figure 2.60 Use of default namespace

KEY TERMS AND CONCEPTS

Binary entity
Character entity
Character encoding
Entity
Escape sequence
General Markup Language (GML)
Inverted tree
Markup
Markup language
Namespace
Namespace prefix
Standard Generalised Markup Language (SGML)
Text entity
Uniform Resource Identifier (URI)
Uniform Resource Locator (URL)

CHAPTER SUMMARY

XML documents can be created by using any text editor and can be viewed with the help of any Web browser.
When the XML document is viewed in the browser, all the elements are displayed in the same manner and the hierarchy of the elements is preserved.
XML tags specify rules that specify how the XML document can be broken down into parts and subparts. These tags are also known as Markup and hence XML is also known as Markup language.
XML is used for classifying text for information storage, classification, and retrieval purposes. It is written in such a manner that it can be extended easily, depending on the business domain, particular sets of requirements, or technology. Therefore, it has its name Extensible Markup Language.
HTML and XML are based on the Standard Generalised Markup Language (SGML). It is not used to create the contents but to create other languages.
SGML was created to provide a means for identifying the portions and content of a document, not by line numbers or the actual content, but by the type of the content.
SGML is based on yet another language called as General Markup Language (GML). GML was developed at IBM in the 1960s.
An XML document is similar to an inverted tree.
XML elements can contain data, other sub-elements, or nothing. But XML does not mandate that we use specific layering or organising strategy for elements. The way we layer the data depends on our requirements.
The comments are used to provide information about the document elements and can also be used to temporarily hide or conceptually disable the portions of an XML document that are not in use currently.
The comments cannot appear before the xml tag, cannot break the element, cannot come inside an XML element declaration, and cannot be nested.
Ordering or sequencing of elements has no relevance in XML documents.
The process of transforming the information in an XML format consists of identifying the elements and attributes.
The way the information is represented in the XML document depends on how we want the information to be accessible to the application programrs as well as to the end users.
The <?xml> tag identifies the document as an XML document. This must always be the first statement in a XML document.
Character encoding allows us to specify the language based on the ISO standards or Unicode standards, which is used to create the markup and the contents of the document.
Every XML document must have exactly one root element. All the other contents of the XML document are a sub-part of this root element.
Every element having value or sub-elements must have an opening and a closing tag.
Empty elements can have an opening and closing tag pair or can have only one tag <elementname/>.
An entity in XML represents a text that we want to use repeatedly without having to write it every time. They can be of three types, namely, character, binary and text.
Character entities are special character codes that assign a different meaning to a special symbol.
Text entities are used to associate large or repeated blocks of text with a name and replace the text with the entity name.
Binary entities are used to associate a name with binary data (such as an image or a video) and use the entity name instead of the actual binary data.
The elements in the XML should contain at least one letter: a-z or A-Z, start with an alphabet or an underscore and can contain letters, digits, hyphens, underscores, and full stops.
The XML element names are case sensitive. The child elements must be nested completely inside the parent element.
Attributes in HTML are used to control the display or formatting characteristics. In XML, they are used for providing more information about certain elements. The attributes in XML merely provide an alternative to sub-elements.
XML namespaces provide us more information about elements in the form of qualifiers, thus removing ambiguities, if any.
The prefix that we add to an element to make it unambiguous is called as a namespace prefix.
The namespace prefix is defined in the root element, and is used inside the elements to describe data.

PRACTICE SET

True or False Questions

The XML file should always have '.xml' as extension.
The hyphen symbol shown to the left of all the start element tags, when viewed in the XML documents helps to expand or compress the element details.
If we enclose the details of the element BOOK in the XML document between tags <BOOK> and </Book> there will be no error when viewed in Web browser.
There is one and only one way to represent the same XML content in different documents.
Every element in the XML should have either a sub-element or some value.
The xml tag must be the first statement for any XML document to be valid.
The keyword NDATA, when used with Entity tag indicates that this is “not XML data”.
<-StudentName> is an invalid element name.
The attributes and sub-elements cannot be used interchangeably in the XML document.
The keyword xmlns without a prefix is used to define the default namespace.

Multiple Choice Questions

The ________ symbol on the left of the element indicates that the information for this symbol is hidden.
1. *
2. –
3. ?
4. +
The comments in the XML document are enclosed between the________.
1. </-- and --/> tags
2. </-- and -- > tags
3. <? And ?> tags
4. <CS_ and _CE>tags
The comments should be enclosed inside the tag boundaries________.
1. </-- and -- >
2. <? And ?>
3. <// and //>
4.
The version that the XML document is following is stated by the________.
1. Version tag
2. Standalone tag
3. Xml tag
4. None of above
The empty element can be written as ________.
1. <Element Name><ElementName>
2. <Element Name></ElementName>
3. < ElementName/>
4. B & C
The syntax for writing character entities is ________.
1. < Entity name;
2. ? Entity name;
3. $ Entity name;
4. &Entity name;
The valid syntax for declaring the text entity PM with value Manmohan is ________.
1. <!ENTITY PM “Manmohan”>
2. <? ENTITY PM “Manmohan”>
3. </ ENTITY PM “Manmohan”/>
4. < ENTITY PM “Manmohan”>
The ________ represents the quotation mark (“) character.
1. &quote
2. "
3. &quot
4. &quote
The namespace prefix is defined inside the ________.
1. xml tag
2. Root element
3. Anywhere in a or b
4. Between xml tag and root element
The XML element name can contain special characters ________, ________ and ________.
1. hyphens, underscores, full stops
2. hyphens, colon, full stop
3. full stop, colon, semicolon
4. ampersand, colon, hyphen

Detailed Questions

Write a note on the concept of markup.
Write a short note on how XML organises data.
Explain with an example the steps to be followed in transforming the information from any format into XML format.
Write a short note on the xml tag and its attributes.
What are character entities? What are they used for?
What are the element naming and nesting conventions?
What are the parameters that help us decide between representing something as a sub-element, and not as an attribute (or vice-versa)?
Define and explain the term binary entity.
What are attributes? How are they different in HTML and XML?
What are XML namespaces? How are they used to remove ambiguity in element names?

Exercises

Think about the concept of markup. In which situations have you encountered it?
Try to figure out why XML uses the readable markup to delimit text. Why does it not use a binary format, for example?
Take the example of a shopping cart. Create an XML file to depict the various aspects of the shopping cart.
Would you model everything in a shopping cart as elements, or as attributes? Why?
Is an attribute an alternative to a sub-element in all situations? Why?
Is the concept of entities unique to XML? Do programming languages (e.g., C, C++, C#, Java) have something similar? Investigate.
Where would you use HTML and where would you think about XML?
Can HTML and XML be interchanged? Why?
The namespace syntax can be quite intimidating to start with. Try reading the specifications of the namespace syntax.
Think about a situation where your data will originate as XML and be presented to the user in the final form as HTML. Who will do this conversion? Do we have to do this manually?

ANSWERS TO EXERCISES