Working with XML Documents

To be fully accessible, an XML document must be entirely loaded in memory and its nodes and attributes mapped to relative objects derived from the Xml­Node class. The process that builds the XML DOM triggers when you call the Load method. You can use a variety of sources to indicate the XML document to work on, including disk files and URLs and also streams and text readers.

Loading XML Documents

The Load method always transforms the data source into an XmlTextReader object and passes it down to an internal loader object, as shown here:

public virtual void Load(Stream);
public virtual void Load(string);
public virtual void Load(TextReader);
public virtual void Load(XmlReader);

The loader is responsible for reading all the nodes in the document and does that through a nonvalidating reader. After a node has been read, it is analyzed and the corresponding XmlNode object created and added to the document tree. The entire process is illustrated in Figure 5-4.

Figure 5-4. The loading process of an XmlDocument object.


Note that before a new XmlDocument object is loaded, the current instance of the XmlDocument object is cleared. This means that if you reuse the same instance of the XmlDocument class to load a second document, the existing contents are entirely removed before proceeding.

Important

Although an XML reader is always used to build an XML DOM, some differences can be noticed when the reader is built internally—that is, you call Load on a file or a stream—or explicitly passed by the programmer. In the latter case, if the reader is already positioned on a nonroot node, only the siblings of that node are read and added to the DOM. If the current reader’s node can’t be used as the root of a document (for example, attributes or processing instructions), the reader reads on until it finds a node that can be used as the root. Pay attention to the state of the reader before you pass it on to the XML DOM loader.


Let’s see how to use the XML DOM to build a relatively simple example—the same code that we saw in action in Chapter 2 with readers. The following code parses the contents of an XML document and outputs its element node layout, discarding everything else, including text, attributes, and other nonelement nodes:

using System;
using System.Xml;

class XmlDomLayoutApp
{
    public static void Main(String[] args)
    {
        try {
            String fileName = args[0];
            XmlDocument doc = new XmlDocument();
            doc.Load(fileName);
            XmlElement root = doc.DocumentElement;
            LoopThroughChildren(root);
        }
        catch (Exception e) {
            Console.WriteLine("Error:	{0}
", e.Message);
        } 

        return;
    }

    private static void LoopThroughChildren(XmlNode root)
    {
        Console.WriteLine("<{0}>", root.Name);
        foreach(XmlNode n in root.ChildNodes)
        {
            if (n.NodeType == XmlNodeType.Element)
                LoopThroughChildren(n);
        }
        Console.WriteLine("</{0}>", root.Name);
    }
}

After creating the XML DOM, the program begins a recursive visit that touches on all internal nodes of all types. The ChildNodes list returns only the first-level children of a given node. Of course, this is not enough to traverse the tree from the root to the leaves, so the LoopThroughChildren method is recursively called on each element node found. Let’s call the program to work on the following XML file:

<platforms type="software">
  <platform vendor="Microsoft">.NET</platform>
  <platform vendor="" OpenSource="yes">Linux</platform>
  <platform vendor="Microsoft">Win32</platform>
  <platform vendor="Sun">Java</platform>
</platforms>

The result we get using the XML DOM is shown here and is identical to what we got from readers in Chapter 2:

<platforms>
<platform></platform>
<platform></platform>
<platform></platform>
<platform></platform>
</platforms>

Well-Formedness and Validation

The XML document loader checks only input data for well-formedness. If parsing errors are found, an XmlException exception is thrown and the resulting XmlDocument object remains empty. To load a document and validate it against a DTD or a schema file, you must use the Load method’s overload, which accepts an XmlReader object. You pass the Load method a properly initialized instance of the XmlValidatingReader class, as shown in the following code, and proceed as usual:

XmlTextReader _coreReader;
XmlValidatingReader reader;
_coreReader = new XmlTextReader(xmlFile);
reader = new XmlValidatingReader(_coreReader);
doc.Load(reader);

Any schema information found in the file is taken into account and the contents are validated. Parser errors, if any, are passed on to the validation handler you might have defined. (See Chapter 3 for more details on the working of .NET Framework validating readers.) If your validating reader does not have an event handler, the first exception stops the loading. Otherwise, the operation continues unless the handler itself throws an exception.

Loading from a String

The XML DOM programming interface also provides you with a method to build a DOM from a well-formed XML string. The method is LoadXml and is shown here:

public virtual void LoadXml(string xml);

This method neither supports validation nor preserves white spaces. Any context-specific information you might need (DTD, entities, namespaces) must necessarily be embedded in the string to be taken into account.

Loading Documents Asynchronously

The .NET Framework implementation of the XML DOM does not provide for asynchronous loading. The Load method, in fact, always work synchronously and does not pass the control back to the caller until completed. As you might guess, this can become a serious problem when you have huge files to process and a rich user interface.

In similar situations—that is, when you are writing a Windows Forms rich client—using threads can be the most effective solution. You transfer to a worker thread the burden of loading the XML document and update the user interface when the thread returns, as shown here:

void StartDocumentLoading()
{
    // Create the worker thread
    Thread t = new Thread(new ThreadStart(this.LoadXmlDocument));

    statusBar.Text = "Loading document...";
    t.Start(); 
}

void LoadXmlDocument()
{
    XmlDocument doc = new XmlDocument();
    doc.Load(InputFile.Text);

    // Update the user interface
    statusBar.Text = "Document loaded.";
    Output.Text = doc.OuterXml; 
    Output.ReadOnly = false;

    return;
}

While the secondary thread works, the user can freely use the application’s user interface and the huge size of the XML file is no longer a serious issue—at least as it pertains to loading.

Extracting XML DOM Subtrees

You normally build the XML DOM by loading the entire XML document into memory. However, the XmlDocument class also provides the means to extract only a portion of the document and return it as an XML DOM subtree. The key method to achieve this result is ReadNode, shown here:

public virtual XmlNode ReadNode(XmlReader reader);

The ReadNode method begins to read from the current position of the given reader and doesn’t stop until the end tag of the current node is reached. The reader is then left immediately after the end tag. For the method to work, the reader must be positioned on an element or an attribute node.

ReadNode returns an XmlNode object that contains the subtree representing everything that has been read, including attributes. ReadNode is different from ChildNodes in that it recursively processes children at any level and does not stop at the first level of siblings.

Visiting an XML DOM Subtree

So far, we’ve examined ways to get XML DOM objects out of an XML reader. Is it possible to call an XML reader to work on an XML DOM document and have the reader visit the whole subtree, one node after the next?

Chapter 2 introduced the XmlNodeReader class, with the promise to return to it later. Let’s do that now. The XmlNodeReader class is an XML reader that enables you to read nodes out of a given XML DOM subtree.

Just as XmlTextReader visits all the nodes of the specified XML file, XmlNodeReader visits all the nodes that form an XML DOM subtree. Note that the node reader is really capable of traversing all the nodes in the subtree no matter the level of depth. Let’s review a situation in which you might want to take advantage of XmlNodeReader.

The XmlNodeReader Class

Suppose you have selected a node about which you need more information. To scan all the nodes that form the subtree using XML DOM, your only option is to use a recursive algorithm like the one discussed with the LoopThroughChildren method in the section “Loading XML Documents,” on page 219. The XmlNode­Reader class gives you an effective, and ready-to-use, alternative, shown here:

// Select the root of the subtree to process
XmlNode n = root.SelectSingleNode("Employee[@id=2]");
if (n != null) 
{
        // Instantiate a node reader object
        XmlNodeReader nodeReader = new XmlNodeReader(n);

        // Visit the subtree
         while (nodeReader.Read())
        {
            // Do something with the node...
            Console.WriteLine(nodeReader.Value);
        }
}

The while loop visits all the nodes belonging to the specified XML DOM subtree. The node reader class is initialized using the XmlNode object that is the root of the XML DOM subtree.

Updating Text and Markup

Once an XML document is loaded in memory, you can enter all the needed changes by simply accessing the property of interest and modifying the underlying value. For example, to change the value of an attribute, you proceed as follows:

// Retrieve a particular node and update an attribute
XmlNode n = root.SelectSingleNode("days");
n.Attributes["module"] = 1;

To insert many nodes at the same time and in the same parent, you can exploit a little trick based on the concept of a document fragment. In essence, you concatenate all the necessary markup into a string and then create a document fragment, as shown here:

XmlDocumentFragment df = doc.CreateDocumentFragment();
df.InnerXml = "<extra>Value</extra><extra>Another Value</extra>";
parentNode.AppendChild(df);

Set the InnerXml property of the document fragment node with the string, and then add the newly created node to the parent. The nodes defined in the body of the fragment will be inserted one after the next.

In general, when you set the InnerXml property on an XmlNode-based class, any detected markup text will be parsed, and the new contents will replace the existing contents. For this reason, if you want simply to add new children to a node, pass through the XmlDocumentFragment class, as described in the previous paragraph, and avoid using InnerXml directly on the target node.

Detecting Changes

Callers are notified of any changes that affect nodes through events. You can set event handlers at any time and even prior to loading the document, as shown here:

XmlDocument doc = new XmlDocument();
doc.NodeInserted += new XmlNodeChangedEventHandler(Changed);
doc.Load(fileName);

If you use the preceding code, you will get events for each insertion during the building of the XML DOM. The following code illustrates a minimal event handler:

void Changed(object sender, XmlNodeChangedEventArgs e)
{
    Console.WriteLine(e.Action.ToString()); 
}

Note that by design XML DOM events give you a chance to intervene before and after a node is added, removed, or updated.

Limitations of the XML DOM Eventing Model

Although you receive notifications before and after an action takes place, you can’t alter the predefined flow of operations. In other words, you can perform any action while handling the event, but you can’t cancel the ongoing operation. This also means that you can’t just skip some nodes based on run-time conditions. In fact, the event handler function is void, and all the arguments passed with the event data structure are read-only. Programmers have no way to pass information back to the reader and skip the current node. There is only one way in which the event handler can affect the behavior of the reader. If the event handler throws an exception, the reader will stop working. In this case, however, the XML DOM will not be built.

Selecting Nodes by Query

As mentioned, the XML DOM provides a few ways to traverse the document forest to locate a particular node. The ChildNodes property returns a linked list formed by the child nodes placed at the same level. You move back and forth in this list using the NextSibling and PreviousSibling methods.

You can also enumerate the contents of the ChildNodes list using a foreach-style enumerator. This enumerator is built into the XmlDocument class and returned on demand by the GetEnumerator method, as shown here:

foreach(XmlNode n in node.ChildNodes)
{
    // Do something
}

Direct Access to Elements

The GetElementById method returns the first child node below the current node that has an ID attribute with the specified value. Note that ID is a particular XML type and not simply an attribute with that name. An attribute can be declared as an ID only in an XML Schema Definition (XSD) or a DTD schema. The following XML fragment defines an employeeid attribute of type ID. The attribute belongs to the Employee node.

<!ATTLIST Employee employeeid ID #REQUIRED>

A corresponding XML node might look like this:

<Employee employeeid="1" LastName="Davolio" FirstName="Nancy" />

As you can see, the source XML is apparently unaffected by the use of an ID attribute.

An ID attribute can be seen as an XML primary key, and the GetElementById method—part of the W3C DOM specification—represents the search method that applications use to locate nodes. The following code retrieves the node element in the document whose ID attribute (employeeid) matches the specified value:

employeeNode = node.GetElementById("1");

If you call GetElementById on a node whose children have no ID attributes or matching values, the method returns null. The search for a matching node stops when the first match is found.

Another query method at your disposal is GetElementsByTagName. As the name suggests, this method returns a list of nodes with the specified name. GetElementsByTagName looks similar to ChildNodes but differs in one aspect. Whereas ChildNodes returns all the child nodes found, including all elements and leaves, GetElementsByTagName returns only the element nodes with a particular name. The name specified can be expressed as a local as well as a namespace-qualified name.

XPath-Driven Access to Elements

The methods SelectNodes and SelectSingleNode provide more flexibility when it comes to selecting child nodes. Both methods support an XPath syntax (see Chapter 6) to select nodes along the XML subtree rooted in the current node. There are two main differences between these methods and the other methods we’ve examined, such as ReadNode and XmlNodeReader.

The first difference is that an XPath query lets you base the search at a deeper level than the current node. In other words, the query expression can select the level of child nodes on which the search will be based. All other search methods can work only on the first level of child nodes.

The second difference is that an XPath expression lets you select nodes based on logical criteria. The code in this section is based on the following XML layout:

<MyDataSet>
    <NorthwindEmployees>
        <Employee id="1" />
        ...
    </NorthwindEmployees>
</MyDataSet>

By default, the SelectNodes and SelectSingleNode methods work on the children of the node that calls it, as follows:

root.SelectNodes("NorthwindEmployees"); 
root.SelectNodes("NorthwindEmployees/Employee"); 
root.SelectNodes("NorthwindEmployees/Employee[@id>4]");

An XPath expression, however, can traverse the tree and move the context for the query one or more levels ahead, or even back. The first query selects all the NorthwindEmployees nodes found below the root (the MyDataSet node). The second query starts from the root but goes two levels deeper to select all the nodes named Employee below the first NorthwindEmployees node. Finally, the third query adds a stricter condition and further narrows the result set by selecting only the Employee nodes whose id attribute is greater than 4. By using special syntax constructs, you can have XPath queries start from the root node or any other node ancestor, regardless of which node runs the query. (More on this topic in Chapter 6.)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.168.2