History of XML

The first thing to understand about Extensible Markup Language (XML) is that the name is a misnomer. XML is not a markup language. XML is a set of rules for creating new markup languages. So, let's begin with a definition of XML and then deconstruct the definition into its parts:

XML is a subset of the Standard Generalized Markup Language (SGML) that specifies the rules for creating markup languages, such as the Hypertext Markup Language (HTML) that can be shared on the World Wide Web (WWW).

Since XML is a subset of SGML and every valid XML document is also a valid SGML document, let's first examine SGML. SGML was a standardization of the Generalized Markup Language invented by Dr. Charles Goldfarb, Ed Mosher, and Ray Lorie in 1969 while working for IBM. It is interesting to note that GML is the same initials as the three inventor's last names. The idea behind GML was to formalize the markup of text documents that was already occurring in the publishing industry. When a publisher received raw text, a copy editor marked up the document with instructions to the typesetter on layout, fonts, indentation, and spacing. These hand-written instructions are traditionally called markup. Generic markup separates the logical structure of a document from its content. After you have the structure of a document, you can attach a stylesheet to it that specifies formatting instructions for each structural element. In 1974, Dr. Goldfarb produced the first validating parser for GML. Between 1978 and 1986, Dr. Goldfarb acted as technical lead of a team to transition GML into ISO 8879, which they called SGML.

As stated earlier, markup allows you to separate the structure of a document from its content.

The structure denotes the purpose of the document's data. In computer science, this type of data that describes other data is called meta-data. For example, if I circle the first sentence in a paragraph and label it as a title, that label is meta-data about the original data (the circled sentence). Both the label title and the sentence are data; however, the label "title" is information intended to clarify or modify the information it is applied to. So, meta-data is data that is outside the original data, for the purpose of enhancing or clarifying the original. So, SGML is a formalized method for capturing the meta-data for a document by using markup on the content. Table 1.1 is an example of data and meta-data.

Table 1.1. Data and Meta-data
Data Meta-data
Michael Daconta Name
4296 Razor Hill Road Street
Bealeton City
VA State
22712 Zip Code

So, SGML defines the rules for creating markup. A markup language is an application of SGML (or the rules thereof). A markup language is composed of a set of markup tags (words that describe purpose) and a specific tree structure to describe one type of document. The legal tags and tree description (also called a content model) reside in a file called a document type definition (DTD). The best known example of a markup language is HTML. Let's discuss a brief history of HTML to understand how it relates to SGML and its influence on XML.

In 1989, Tim Berners-Lee and Anders Berglund developed a markup language for hyperlinked text documents.[1] It is interesting to note that they started from one of the first published SGML document type definitions from an IBM manual written by Dr. Goldfarb. Their markup language was called the Hypertext Markup Language (HTML). HTML is a language that describes hypertext documents. A hypertext document is a document with a head and body whose body can contain text, lists, links to other documents, images, forms, frames, and other components.

[1] Berners-Lee, Tim, "Information Management: A proposal" March 1989, May 1990

HTML was an application of SGML. The key components of HTML are start tags, end tags, elements, attributes, comments, and entity references. Listing 1.1 is a simple example of an HTML document.

Code Listing 1.1. A Simple HTML Example
<HTML>
    <HEAD>
        <TITLE> Java Programming Tutorial </TITLE>
    </HEAD>
    <BODY>
        <!-- Created 2/8/1999 -->
        <UL>
        <LI> <A HREF="chap1.html">Chapter One </A>
        <LI> <A HREF="chap2.html">Chapter Two </A>
        </UL>
    </BODY>
</HTML>

Table 1.2 highlights the key HTML components from Listing 1.1.

Table 1.2. HTML Components
HTML Component Description
<BODY> Start tag
</BODY> End tag
<TITLE> Java Programming Tutorial</TITLE> Element
<A HREF="chap1.html"> Attribute
<!-- Created 2/8/1999 --> Comment

Tim Berners-Lee and his team at the Centre Européen pour la Recherche Nucléaire (CERN, or the European Center for Nuclear Research) also created the Hypertext Transfer Protocol (HTTP), the Web browser, and the Web server. In February of 1993, while a student at the University of Illinois and working at the National Center for Supercomputing Applications (NCSA), Marc Andreessen (and a small team of peers) created a graphical Web browser called Mosaic. Although Mosaic was not the first graphical Web browser, the NCSA team ported it to the three most popular platforms (Windows, Mac, and Unix) and gave it away free. In 1994, Jim Clark (founder of Silicon Graphics, Inc.) and Marc Andreessen co-founded Netscape Communications (originally called Mosaic Communications Corp). Soon after, they released the Netscape Navigator Web browser. As the Web grew in popularity, HTML was extended for new purposes; however, soon it became apparent that proprietary extensions to HTML were counter-productive and ill-suited to general use. So, to solve the problems of interoperability and scalability on the Web without extending HTML, the W3C began work on a simplified version of SGML which it called the Extensible Markup Language (XML).

Some people attempt to qualitatively compare HTML to XML because XML is seen as a replacement to HTML. XML is not intended as a replacement for HTML and both are complementary technologies. XML is a more general and better solution to the problem of sharing data on the Web than extending HTML. It should be obvious that extending a single language to every possible case is impossible. Unique cases require unique languages, which is why each vertical industry quickly develops its own specialized jargon. Those vertical domains of data can be structured and captured by an XML-compliant markup language and presented via HTML or the Extensible Stylesheet Language (XSL), covered in Chapter 5.

The Extensible Markup Language (XML) 1.0 was made a World Wide Web Consortium (W3C) recommendation on February 10, 1998. A recommendation is the final step in the W3C process for creating Web standards. As stated earlier, XML is a subset of SGML, which specifies specific syntactic and semantic rules and constraints for creating new markup languages. Although it is currently common practice to describe documents that follow the XML standard as "XML documents," that description is too generic. You actually create HTML, MathML (Math Markup Language), or CML (Chemical Markup Language) documents, which you design using the rules laid out in the XML specification. So, this makes the term XML document an abstract concept that has no concrete implementation. Listing 1.2 is an example of an Address Book Markup Language (ABML) document. ABML is an XML-compliant language.

Code Listing 1.2. ABML Document
<?xml version="1.0"?>

<!DOCTYPE ADDRESS_BOOK SYSTEM "abml.dtd">
<ADDRESS_BOOK>
    <ADDRESS>
<NAME>Michael Daconta </NAME>
<STREET>4296 Razor Hill Road </STREET>
        <CITY>Bealeton </CITY>
        <STATE>VA </STATE>
        <ZIP>22712 </ZIP>
    </ADDRESS>
    <ADDRESS>
        <NAME>Sterling Software </NAME>
        <STREET> 7900 Sudley Road</STREET>
        <STREET> Suite 500</STREET>
        <CITY>Manassas</CITY>
        <STATE>VA </STATE>
        <ZIP>20109 </ZIP>
    </ADDRESS>
</ADDRESS_BOOK>

Since HTML predated XML, it is not currently XML compliant; however, the next version of HTML from the W3C will be XML compliant. So, let's examine some of the differences between HTML and XML. By comparing Listing 1.1 and Listing 1.2, you can see the major constructs of both languages, such as start and end tags. Also, comments and attributes are identical. The differences result in XML being a stricter specification. Here is a list of the differences:

Note

The W3C has released a Recommendation of HTML reformulated as an XML 1.0– compliant language and called XHTML 1.0. See http://www.w3.org/tr/xhtml1.


  • XML elements (composed of a start and end tag) must be strictly nested. For example, <B><I> Improper Nesting </B></I> is legal in HTML but illegal in XML. The proper nesting would be for the <I> element to be fully enclosed within the <B> element like this: <B><I> Proper Nesting </I></B>.

  • In XML, every start tag must have an end tag. For example, in HTML the List Item and paragraph tags are not required to have an end tag. In HTML, this is legal: <LI> Platoon Leader, US Army. That statement would be illegal in XML. In XML, you would write <LI> Platoon Leader, US Army </LI>.

  • Both HTML and XML allow empty tags. An empty tag does not have any content associated with it. For example, the HTML image tag is <IMG SRC="portrait.jpg">. However, in XML an empty tag must have a forward slash before the closing greater than character like this: <IMG SRC="portrait.jpg" /> You can think of this as combining a start and end tag.

  • XML documents allow only one root element. A root element is the top-level element that contains all other elements. For example, in Listing 1.1, <HTML> is the root element. While one root element is the norm, this requirement is formalized in the XML specification.

  • All attribute values in XML must be surrounded by single or double quotes. In HTML a single value does not require quotes. So, where in HTML you can have <PARAM value=10>, in XML it must be <PARAM="10">.

  • XML tags are case sensitive, whereas HTML tags are not. In HTML, <BODY> and <body> are the same tag. In XML, they are two different tags.

  • Whitespace between tags is ignored in HTML but is preserved in XML and considered relevant and is passed on by the XML parser to the processing application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.137.164