Handling XML in Tcl

Tip

This section assumes that the reader is familiar with terms like XML, DOM, SAX, XPath and others related.

Just as with almost every modern programming language, Tcl offers comprehensive set of tools designed to facilitate the use of the XML standard. Basically Tcl is born to work with XML, thanks to native support for Unicode encoding. There are a lot of extensions for handling XML documents, but without a doubt one of the most important is tDOM (http://www.tdom.org). The power of this extension comes from the fact it is implemented in C language, therefore it is fast and efficient, with reasonable memory consumption; the extension supports XPath, XSLT and optional DTD validation. From a practical point of view it is also easy to get, as it is included in the ActiveTcl distribution.

The name of this extension clearly indicates that it works as an XML Document Object Model (DOM) parser. In other words, the document is treated as a tree-like structure. While we will focus on DOM, it is worth noting that tDOM is also able to act like an event-driven, SAX parser.

Across this section, we will use the following example XML document that pretends to define this book:

<?xml version="1.0" encoding="UTF-8"?>
<books>
<book isbn="978-1-849510-96-7">
<title>Tcl Network Programming</title>
<authors>
<author>Piotr Bełtowski</author>
<author>Wojciech Kocjan</author>
</authors>
</book>
</books>

The document is saved in test.xml file. We do not intend to replicate the comprehensive tDOM manual available on the website, but rather to give a taste of using tDOM.

To start working with tDOM, we have to load the package first:

package require tdom

Next it is time to parse the XML document, which will affect the creation of complete DOM tree in the memory, using the dom parse command:

set channel [open test.xml]
fconfigure $channel -encoding utf-8
set doc [dom parse [read $channel]]
close $channel

As the test.xml file is encoded in UTF-8, the input channel must be correctly configured. The basic syntax of the parse command is dom parse $xml, where xml is the variable that holds the entire xml document. This command is also able to read directly from the channel (the encoding must still be configured correctly):

set doc [dom parse -channel $channel]

The command returns a Tcl command object (stored in doc) that provides a wide set of methods allowing interaction with the DOM document object that was created as a result of the parse action. The methods can be called according to pattern: $doc methodName ?arg ...?

There is a convenient wrapper method tDOM::xmlReadFile that automatically handles encoding and some often annoying actions such as closing channels. The previous code sample can be expressed in one line:

set doc [dom parse [tDOM::xmlReadFile open.xml]]

As mentioned before, the DOM object is a tree-like structure, and the basic term is a node. Essentially, every XML tag in the document will be mapped to a corresponding node. By calling the appropriate commands on a DOM object (referred via $doc) you are able to get Node command object(s) that will allow direct interaction with nodes (again with the appropriate methods set). For example, let's inspect the authors of the book:

set authors [$doc getElementsByTagName authors]
puts "<authors> has child nodes: [$authors hasChildNodes]"
set author [$authors firstChild]
puts [[$author firstChild] nodeValue]
set author [$author nextSibling]
puts [[$author firstChild] nodeValue]

First, we retrieve all nodes having the name authors. In this case it is only one element, but more generally the command $doc getElementsByTagName tagName would return a Tcl list of elements. From this moment $authors refers to the<authors> node, and we are able to execute on it methods appropriate for nodes:

  • hasChildNodes returns the boolean value 0 or 1 depending on the whether the node has child nodes
  • firstChild returns the first child node of the node the method is executed on
  • nodeValue returns the value of the node
  • nextSibling returns the next sibling node to the node the method is executed on

It is important to note that the first $author value corresponds to<author>Piotr Bełtowski</author>, and [$author firstChild] returns a text node from its insides, which contains the text"Piotr Bełtowski". Knowing all that, it is not a surprise that the output thus far is:

<authors> has child nodes: 1
Piotr Bełtowski
Wojciech Kocjan

The method nodeValue allows us not only to get the value, but also to set a new one:

[[$authors firstChild] firstChild] nodeValue "John Smith"

To verify that the value has been changed, let's print the authors again but in a different way:

foreach author [$authors childNodes] {
puts [[$author firstChild] nodeValue]
}

As expected, the output is:

John Smith
Wojciech Kocjan

The method childNodes returns a list of child nodes, so it is made to iterate over it.

The node can be deleted:

[$authors lastChild] delete

So now the list of authors will be shorter:

set authors [$doc selectNodes /books/book/authors]
foreach author [$authors childNodes] {
puts [[$author firstChild] nodeValue]
}

This time in the command: $doc selectNodes xpathQuery we used XPath expression /books/book/authors" to address all (in this case there is only one)<authors> elements.

Describing the XPath syntax is out of the scope of this book, but for quick introduction here are some examples:

Xpath expression

Result

/

Selects the root node

/tag

Selects the root node element(s) "tag"

/tag1/tag2

Selects all "tag2" elements that are direct children of the root node "tag1" element

//tag

Selects all "tag" element no matter of their real position in the document

/tag/text()

Selects the text content from all "tag" root node elements; in this case the value is returned instead of XML node

For more information visit http://en.wikipedia.org/wiki/XPath or read the tutorial at http://www.w3schools.com/XPath/xpath_syntax.asp.

The same result can be achieved more easily by directly addressing the appropriate text nodes:

set authorNames [$doc selectNodes 
/books/book/authors/author/text()]
foreach authorName $authorNames {
puts [$authorName nodeValue]
}

In both cases, the output is the same:

John Smith

Talking about XPath, its syntax is completely normal, with one remark that square brackets [] have a special meaning in Tcl, so they have to be escaped correctly.

Thus far we have learned how to modify or remove nodes, so let's add some new authors:

set element [$doc createElement author]
$element appendChild [$doc createTextNode "Jean-Luc Picard"]
$authors appendChild $element

The method $doc createElement elementName creates an element of the given name—in this case the name is "author". Next, we create a text node with the content "Jean-Luc Picard" using $doc createTextNode text method. The text node is appended to the author element, and the element is finally added as a child of<authors>.

The tDOM package offers many ways to achieve the same result, and for example the author could be added using much shorter, but not so elegant way:

$authors appendXML "<author>James T. Kirk</author>"

appendXML takes a raw XML string as an input argument, parses it and creates an appropriate DOM sub-tree that is subsequently merged into the main document tree.

To verify that the authors were added we could again print the list of names, but this time let's convert entire DOM tree back to an XML document using asXML method:

puts [$doc asXML]

And the result is indeed:

<books>
<book isbn="978-1-849510-96-7">
<title>Tcl Network Programming</title>
<authors>
<author>John Smith</author>
<author>Jean-Luc Picard</author>
<author>James T. Kirk</author>
</authors>
</book>
</books>

XML elements do not only have a value, but it also may have attributes. The corresponding node command object offers the following methods to handle them:

  • $node attributes—returns the list of all attributes existing for node $node
  • $node hasAttribute attributeName—returns the Boolean value 0 or 1 depending on whether the attribute of name attributeName exists for the node $node
  • $node getAttribute attributeName ?defaultValue?—returns the attribute value. The command will fail if the attribute does not exist, unless it is provided with a defaultValue that would be returned in this case.
  • $node setAttribute attributeName value—sets the new value of the attribute. This command will create the attribute if it does not already exist.
  • $node removeAttribute attributeName—removes the attribute.

Knowing this, we can play with book's attributes:

set book [$doc selectNodes /books/book]
if {[$book hasAttribute isbn]} {
puts [$book getAttribute isbn]
}
puts [$book getAttribute notExisting "attribute not defined!"]
$book setAttribute year 2010

First the code will print out the value of the isbn attribute, and then it will attempt to get the value of the notExisting attribute, resulting in the output:

978-1-849510-96-7
attribute not defined!

Finally, it creates a new year attribute, so puts [$doc asXML] will now write:

<books>
<book isbn="978-1-849510-96-7" year="2010">

<title>Tcl Network Programming</title>
<authors>
<author>John Smith</author>
<author>Jean-Luc Picard</author>
<author>James T. Kirk</author>
</authors>
</book>
</books>

Memory conservation may be a goal, particularly in case of large XML documents, so once we finish working with the DOM document object, we should always delete it and free the memory it was using:

$doc delete

Until now we were working on an existing XML document, but we already know how to programmatically create nodes. The following example shows how to build the text.xml file from the scratch:

package require tdom
set doc [dom createDocument books]
set root [$doc documentElement]
set book [$doc createElement book]
$book setAttribute isbn "978-1-849510-96-7"
set title [$doc createElement title]
$title appendChild [$doc createTextNode "Tcl Network Programming"]
$book appendChild $title
set authors [$doc createElement authors]
$book appendChild $authors
set authorNames [list "Piotr Bełtowski" "Wojciech Kocjan"]
foreach authorName $authorNames {
set author [$doc createElement author]
$author appendChild [$doc createTextNode $authorName]
$authors appendChild $author
}
$root appendChild $book
set channel [open test.xml w]
fconfigure $channel -encoding utf-8
$doc asXML -channel $channel
close $channel

The code is pretty self explanatory, similar to parse method, asXML can also produce output to the channel specified with -channel. While looking at this code sample, you have probably noticed that it is not too readable when it comes to being able to quickly determine what the produced XML will be. Luckily, there is far more legible alternative:

package require tdom
set doc [dom createDocument books]
set root [$doc documentElement]
dom createNodeCmd elementNode book
dom createNodeCmd elementNode title
dom createNodeCmd elementNode authors
dom createNodeCmd elementNode author
dom createNodeCmd textNode text
$root appendFromScript {
book -isbn "978-1-849510-96-7" {
title {text "Tcl Network Programming"}
authors {
author {text "Piotr Bełtowski"}
author {text "Wojciech Kocjan"}
}
}}
puts [$doc asXML]

What we do here is to use dom createNode nodeType commandName command. As a result, a special Tcl command named after commandName is defined. When used, such a command will generate a DOM node of type nodeType, named after commandName. The most common node types are:

  • elementNode—will create normal DOM node
  • textNode—will create text node
  • commentNode—responsible for creating a comment

The created command cannot be used anywhere in the code, but only in a script supplied for the $node appendFromScript script method.

The invocation of every created command for element node is simple: it may take zero or more pairs of attribute names and values (the attribute name may be, but does not require, preceded by '-' character) and an optional Tcl script that may create the node's content and must be in the format accepted by appendFromScript method, as this will be recursively called. Commands generating text or comment nodes are simpler, as they accept only the text data to be inserted into it.

For example, a call to: author {text "Wojciech Kocjan"} will cause the creation of a text node, that will be appended as a child of<author> node, resulting in<author>Wojciech Kocjan</author>.

The result of both code samples described above is identical:

<books>
<book isbn="978-1-849510-96-7">
<title>Tcl Network Programming</title>
<authors>
<author>Piotr Bełtowski</author>
<author>Wojciech Kocjan</author>
</authors>
</book>
</books>

But the readability of the second example is better, especially when correct indentation is preserved in appendFromScript script argument.

tDOM is capable of parsing HTML documents (that are often not compliant with XML specification). To achieve it, use -html option for parse command. For example, the HTML code for www.google.com in Poland is rather simple:

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.pl/">here</A>.
</BODY></HTML>

And using the following code (the http package is described in Chapter 7):

package require tdom
package require http
set html [http::data [http::geturl "http://www.google.com"]]
set doc [dom parse -html $html]
puts [$doc asXML]

We are able to parse it, effectively obtaining the following XML document:

<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8"/>
<title>302 Moved</title>
</head>
<body>
<h1>302 Moved</h1>
The document has moved
<a href="http://www.google.pl/">here</a>
.
</body>
</html>

The attempt to do the same without the -html option would cause an error:

error "mismatched tag" at line 2 character 26
">302 Moved</TITLE></H <--Error-- EAD><BODY>
<H1>302 Moved</H1>
The docum"
while executing
"dom parse $html"
invoked from within
"set doc [dom parse $html]"

The functionality of parsing HTML may be really helpful in terms of network programming—you can easily imagine a situation where you want to grab some information from a web page. A combination of tDOM and appropriate XPath queries may work some miracles, while the code will be less than brief. The following example will display links returned from a search for the 'tDOM' key word using the Bing search engine:

package require tdom
package require http
set token [http::geturl "http://www.bing.com/search?q=tDOM"]
set html [http::data $token]
http::cleanup $token
set doc [dom parse -html $html]
foreach node [$doc selectNodes //div/h3/a[@href]] {
puts [$node getAttribute href]
}

And as we expect, the output is:

http://www.tdom.org/
http://tdom.com/
http://www.tdom.org/domDoc.html
http://www.phpclasses.org/browse/package/5690.html
http://packages.qa.debian.org/t/tdom.html
http://acronyms.thefreedictionary.com/TDOM
http://groups.yahoo.com/group/tdom/
http://www.ohloh.net/p/tdom
http://packages.debian.org/tdom
http://packages.debian.org/unstable/interpreters/tdom

As we only outlined the usage of the tDOM package, for more details please consult the manual (http://www.tdom.org) and freely available examples on the Web, especially on the Tcl wiki webpage (http://wiki.tcl.tk).

It is worth noting that there is another popular package for processing XML documents, named TclXML (http://tclxml.sourceforge.net). The functionality it offers is comparable to tDOM's, some elements of TclXML are written in pure Tcl, and that may be considered as a merit when it comes to supporting some platforms where C-based tDOM may not be available. On the other hand, tDOM has key features such as superior performance and lower memory consumption that can not be overestimated.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.237.194