Tree manipulation

Most XML processors support the tree-walking approach to provide access to the document in a 'random-access' method (by 'tree-walking' the document structure). The XML processor first reads the entire document into memory, before allowing the application instant access to any part of it. There is no need for multi-pass parsing. The processor therefore organizes the content so that it can be easily found and manipulated. Applications that benefit from this approach include XML-sensitive document editors and transformation tools (including XSLT engines).

But, as this is a more sophisticated technique than the event-driven approach described above, there are some concepts and associated terminology that need to be understood.

Nodes

As the XML processor reads the XML document, each component is stored as a separate object in memory. These objects are called nodes. Each node is an object of a specified type, such as an element node or a comment node. The XML processor gives the main application access to these nodes.

The concept of nodes appears in a number of standards, including the Grove (Graph Representation Of Property ValuEs) scheme devised for use with SGML. Some of the examples below are taken from this standard, though a newer, XML-based standard has also emerged and is discussed later.

Properties

Each node contains information about itself, in a manner similar to the way that elements contain attributes. Each kind of node has a pre-defined property, and each property has a name and a value. For example, a node that describes a person may have a property called 'age' which holds the value representing the age of an individual:



One property is particularly significant to every node. This property identifies what kind of node it is. In the Grove standard, this property is called 'type'. When the type property has a value of 'element', another property is used to determine what kind of element it represents. This property is called 'gi' (Generic Identifier). The following node represents a Paragraph element:



Connected nodes

A node may also contain properties that refer to other nodes, and this connection between nodes is known as an arc. A single property may refer to a single other node, or contain a list of references to other nodes. For example, an element node may refer to several attribute nodes:



In the Grove scheme, the arc is labelled according to the name of the property, and may be one of three possible relationship types: 'subnode' (child node), 'irefnode' (internal reference node) or 'urefnode' (unrestricted/external reference node).

Hierarchical structures

Arcs may be used to organize nodes into tree structures that are suitable for describing the hierarchical nature of XML documents. This technique allows an application to 'drill down' into a document to find a required element. It also becomes possible to delete document fragments (branches), or move or copy them to other locations.

However, it is still possible to extract the content of an element (or the entire document, when starting at the root element) in sequential order, though the technique is more complex than for simple event-driven processing. It is necessary to use recursive iteration techniques to traverse nodes that represent all the descendants:



Non-hierarchical structures

In the Grove standard, arcs between nodes do not have to follow a tree-like structure. It is possible for a node to refer to any other node, regardless of its position in the tree, providing that it is an irefnode type. As groves may be multi-directional, even cyclic, they can also represent a DTD structure that includes nested element models.

For example, a node that represents a List element declaration points to other nodes describing the elements that the list may contain. Assuming this is only an Item element, there will be an arc from the List node to the Item node. But assuming that the Item element may also be able to contain a complete list, the Item node will also contain an arc back to the List node (as well as to other elements):



Document Object Model (DOM 1.0)

For exactly the same reasons why it is desirable to have a standard API for event-driven processing, it is equally desirable for a standard to exist for tree-walking access to a parser.

Such an API has been developed for object-oriented languages such as Java, for HTML as well as XML documents. The DOM (Document Object Model) standard has been developed for this purpose, and is free for commercial use (see Chapter 20 and Chapter 21).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.15.94