Canonical XML

The term canonicalization originally was “borrowed” loosely from its more ancient context to indicate that one structure of an instance document is the same as the master, or commonly accepted, structure of the document. Canonicalization is sometimes referred to as C14N for brevity; this is similar to the more common use of I18N for internationalization.

Canonical XML is an emerging W3C recommendation that allows you to see if one physical representation of a document is equivalent to another physical representation of the same document in order to determine if they are “canonically” equivalent. In this section, we explore some of the technical features of Canonical XML to gain a better understanding of its application to suit your needs.

The Canonical XML Data Model

To begin the process of converting a document to canonical form, you, or rather your Canonical XML processor, must start with some form of XML that it can understand. Therefore, your first parameter to a canonical translator should be an XPath node set, or a serialized XML document. The second parameter is a Boolean value, which indicates whether comments should be analyzed.

In the case of a node set, it must have normalized line feeds, normalized attribute values, substituted CDATA sections with their character content, and resolved character and parsed entity references. In other words, each node must be fully cooked. No stranded entities and no superfluous whitespace are allowed. All whitespace within the root element must be preserved with the exception of line-delimiter normalization. The whole approach leads you to think that the document is being worked over—flattened, stretched, and pulled like pizza dough just prior to being cooked.

Document Order

Although Canonical XML depends on XPath, it imposes a few rules on the XPath node sets that are sent into any Canonical XML processor.

  1. An element’s namespace and attribute nodes must follow the element but precede any children.

  2. Namespace nodes must exist prior to attribute nodes.

  3. Namespace nodes for an element are sorted lexicographically by local name.

  4. Attribute nodes for an element are sorted lexicographically with the namespace URI as a primary key and the local name as a secondary key.

Canonical XML Structure

Canonical XML does away with the XML declaration and DTD, and also normalizes whitespace outside of the root element. Abbreviated empty elements (in the style of <element-name/>) are converted to start- and end-tag pairs (<element-name></element-name>). Namespace and attributes may be lexicographically reorganized to comply with canonical expectations as described in Section 2.7.2. In addition to these modifications, a canonical representation replaces CDATA sections with their actual characters, and applies character reference replacement when appropriate. Attribute values and text also have their special characters replaced with references.

Canonical XML is quite new, and we have yet to see significant amounts of Python software developed for Canonical XML processing. The vision of Canonical XML is blurry, but it is a method for checking two instances (regardless of DTD or Schema) and working them over like cleaned fish to see if they share the same skeletons. Version 0.7 of PyXML will include support for rendering XML in canonical form.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.80.209