Working with URIs

Uniform Resource Identifiers (URIs) are used to uniquely identify resources, and they may be absolute or relative. Absolute URIs provide the entire context for identifying the resources, such as http://datypic.com/prod.html. Relative URI references are specified as the difference from a base URI, such as ../prod.html. A URI reference may also contain a fragment identifier following the # character, such as ../prod.html#shirt.

The three previous examples happen to be HTTP Uniform Resource Locators (URLs), but URIs also encompass URLs of other schemes (e.g., FTP, gopher, telnet), as well as Uniform Resource Names (URNs). URIs are not required to be dereferenceable; that is, it is not necessary for there to be a web page or other resource at http://datypic.com/prod.html in order for this to be a valid URI. Sometimes URIs just serve as names. For example, in XQuery, URIs are used as the names of namespaces and collations.

The built-in type xs:anyURI represents a URI reference. Most XQuery functions that accept URIs as arguments call for xs:string values instead, but an xs:anyURI value is acceptable also. This is because of a special type-promotion rule that allows xs:anyURI values to be automatically promoted to xs:string when a string is expected. Most of the URI-related functions return xs:anyURI values, following the philosophy of being liberal in what they accept and specific in what they produce.

Base and Relative URIs

Relative URIs are interpreted relative to an absolute URI, known as a base URI. For example, the relative URI prod.html is useless unless interpreted in the context of an absolute URI. In HTML documents, the base URI is often the URI of the document itself. If an HTML document is located at http://datypic.com/order.html, and it contains a link to prod.html, that prod.html relative URI is resolved in the context of the http://datypic.com/order.html, and the link points to http://datypic.com/prod.html.

Using the xml:base attribute

In XML documents, you can also explicitly specify a base URI using the xml:base attribute. The scope of each xml:base attribute is the element on which it appears and all its content.

Example 20-3 shows an XML document that uses the xml:base attribute on the catalog elements, with relative URI references (the href attributes) for each product. The href="prod443.html" attribute of the first product element, for example, is resolved relative to the xml:base attribute of the first catalog element, namely http://example.org/ACC/.

Example 20-3. Document using xml:base (http://datypic.com/cats.xml)

<catalogs>
  <catalog name="ACC" xml:base="http://example.org/ACC/">
    <product number="443" href="prod443.html"/>
    <product number="563" href="prod563.html"/>
  </catalog>
  <catalog name="WMN" xml:base="http://example.org/WMN/">
    <product number="557" href="prod557.html"/>
  </catalog>
</catalogs>

Finding the base URI of a node

The base-uri function can be used to retrieve the base URI of a node. For document nodes, the base URI is the URI from which the document was retrieved. For example:

base-uri(doc("http://datypic.com/cats.xml"))

returns http://datypic.com/cats.xml.

For element nodes, the base URI is the value of its xml:base attribute, if any, or the xml:base attribute of its nearest ancestor. For example, if $prod is bound to the first product element in cats.xml, the function call:

base-uri($prod)

returns http://example.org/ACC/, because that is the xml:base value of its nearest ancestor.

If no xml:base attributes appear among its ancestors, it defaults to the base URI of the document node, if one exists.

Resolving URIs

The resolve-uri function takes a relative URI and a base URI as arguments, and constructs an absolute URI. For example, the function call:

resolve-uri("prod.html", "http://datypic.com/order.html")

returns http://datypic.com/prod.html.

The base URI of the static context

The base URI of an individual node is set by the xml:base attribute or by the document URI. There is also a separate base URI, known as the base URI of the static context. The base URI of the static context is used in several cases:

  • When an element is constructed in a query, its base URI is set to the base URI of the static context, if one is defined. Otherwise, its base URI is the empty sequence.

  • When relative URI references are used as arguments to the doc and collection functions, or to functions that accept collations as arguments, they are resolved relative to the base URI of the static context.

  • When a base URI argument is not provided to the resolve-uri function, it resolves the URI relative to the base URI of the static context.

The base URI of the static context can be set in the query prolog, using a base URI declaration. Its syntax is shown in Figure 20-1.

Syntax of a base URI declaration

Figure 20-1. Syntax of a base URI declaration

Here's an example of a base URI declaration:

declare base-uri "http://datypic.com";

The base URI must be a literal value in quotes (not an evaluated expression), and it should be a syntactically valid absolute URI.

It is also possible for the processor to set the base URI of the static context outside the scope of the query. Although it is implementation-defined, it's reasonable to expect that if the query itself is read from a file, the base URI of the static context will default to the location of that file. The value of the base URI of the static context can be retrieved using the static-base-uri function.

Documents and URIs

When accessing an input document using the doc function, a URI is used to specify the document of interest. Processors interpret the URI passed to the doc function in different ways. Some, like Saxon, will dereference the URI, that is, go out to the URL and retrieve the resource at that location. Other implementations, such as those embedded in XML databases, consider the URIs to be just names. The processor might take the name and look it up in an internal catalog to find the document associated with that name.

Finding the URI of a document

You can find the absolute URI from which a document node was retrieved using the document-uri function. This function is basically the inverse of the doc function. Where the doc function accepts a URI and returns a document node, the document-uri function accepts a document node and returns a URI.

For example, if the variable $orderDoc is bound to the result of doc(" http://datypic.com/order.xml "), then document-uri($orderDoc) returns " http://datypic.com/order.xml ".

In most cases, this has the same effect as calling the base-uri function on the document node.

Opening a document from a dynamic value

Most of the examples of the doc function in this book use a hardcoded URI, as in doc("order.xml"). However, suppose you wanted to open the documents referenced in Example 20-3. For example, you want to open the product information page for product number 443. Its relative URI is prod443.html, and its base URI is http://example.org/ACC/. To do this, you could use:

let $prod := doc("cats.xml")/catalogs/catalog[1]/product[1]/@href
let $absoluteURI := resolve-uri($prod, base-uri($prod))
return doc($absoluteURI)

which would open the document at http://example.org/ACC/prod443.html.

Escaping URIs

URIs require that some characters be escaped with their hexadecimal Unicode code point preceded by the % character. This includes non-ASCII characters and some ASCII characters, namely control characters, spaces, and several others. In addition, certain characters in URIs are separators that are intended to delimit parts of URIs, namely the characters ; , / ? : @ & = + $ [ ] and %. If these delimiter characters must be used in a URI, having a meaning other than as a delimiter, they too must be escaped.

Three functions are available for escaping URI values: iri-to-uri, escape-html-uri, and encode-for-uri. All three replace each special character with an escape sequence in the form %xx (possibly repeating), where xx is two hexadecimal digits (in uppercase) that represent the character in UTF-8. For example, ../édition.html is changed to ../%C3%A9dition.html, with the é escaped as %C3%A9.

They vary in which characters they escape:

iri-to-uri

Escapes only those characters that are not allowed in URIs, but not the delimiters ; , / ? : @ & = + $ [ ] or %. It is appropriate for escaping entire URIs.

escape-html-uri

Escapes characters as required by HTML agents. Specifically, it escapes everything except ASCII characters 32 to 126. It is appropriate for URIs that are to be handled by browsers.

encode-for-uri

Is the most aggressive of the three. It escapes all the characters that are required to be escaped in URIs, plus all the delimiter characters. It is appropriate for escaping pieces of URIs, such as filenames, that cannot contain delimiter characters.

Note that none of these functions check whether the argument provided is a valid URI; they simply act on the argument as if it were any string.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.46.92