Chapter 14. Beyond the Standard

Introduction

This chapter discusses features that didn't make it into XQuery 1.0 but may appear in future versions. I begin by enumerating features that are likely (in my informed opinion) to change between now and the final XQuery 1.0 standard. Next, I describe how all the XQuery standards documents interrelate with one another and the rest of XML. I then speculate about some of the features that may appear in XQuery 1.1 or 2.0, and highlight two of these features: updating XML data and full-text search. Finally, this chapter ends with some of the performance benchmarks that are already beginning to spring up around XQuery.

Potential Changes

There's an old adage that people who love sausage, respect the law, and work with standards shouldn't watch any of them being made. XQuery has changed a lot since the Working Group committee published its first document in February 2001, from minor syntax changes, such as the keywords becoming lowercase instead of uppercase and the sort operator being replaced by an order by clause in the FLWOR expression (which originally was FLWR), to major semantics changes, such as the ever-evolving type system and navigation operators.

At the time of this writing, XQuery has reached the Last Call stage, the first point at which it has been deemed stable enough to solicit widespread public feedback. Several hundred issues have been raised publicly by vendors and other W3C standards groups, and will be addressed before XQuery becomes a final Recommendation. Many of these issues are minor and will not affect you in any major way, or will not lead to any change at all in the standard (except possibly the wording of it). However, some of the suggested changes would have a major impact on the queries you write. Only time will tell what these changes will be, but here are some of the more likely candidates.

Namespaces

The built-in namespaces and collation are all versioned at the time of this writing, for example, http://www.w3.org/2003/11/xpath-functions. The year and month in the URL correspond to the latest public draft. These will definitely change in the final XQuery standard, but their eventual values are unknown at this time.

Modules and Prolog

Modules are a relatively new feature in XQuery, and many issues are still being hashed out. Their interaction with the prolog, and the desire to design XQuery to allow for future changes, has led to many recent changes to both in the latest drafts. I expect both to continue to fluctuate between now and the final version. For example, recent drafts introduced semicolons after every prolog statement, changed the keywords used, and changed how global definitions in modules interact with each other and the main query.

Additional Types

There has been considerable pushback against the introduction of types not in XML Schema 1.0, namely xdt:dayTimeDuration, xdt:yearMonthDuration, xdt:untypedAtomic, and xdt:anyAtomicType.

The duration subtypes were introduced by the committee because the xs:duration type isn't totally ordered and presents barriers to using date/time arithmetic. These two new types and some twenty built-in functions for working with them were introduced to overcome these difficulties. It has been proposed that instead these types and their attending functions could be replaced by a couple of simple functions that convert xs:duration to and from a numeric type, like xs:decimal, which is then totally ordered and can be manipulated using all the usual arithmetic operators. Users would then have to do a little more work to operate on durations (but date/time programming always requires custom logic). Although the committee may revisit this issue, for now they have decided to keep the duration subtypes. Whether all the functions for working with them will also be kept remains to be seen.

The other two types are more difficult to remove. XML Schema 1.0 has two base types, xs:anyType and xs:anySimpleType. The former includes all schema types (even element and attribute node structures), and the latter includes all atomic types (even union and list types like xs:IDREFS). XQuery needs a base type that covers atomic types only, but excludes list and union types—hence xdt:anyAtomicType.

The xdt:untypedAtomic type exists to distinguish typed data from untyped data. Expressions that involve xdt:untypedAtomic values are defined to use essentially XPath 1.0 rules, while operators on typed values (or a mixture of typed and untyped) use new rules that make sense for XQuery and XML Schema. I expect that the type conversions of language expressions that use this type will undergo additional changes before XQuery is finalized. (They've changed in every draft so far!) However, it's unlikely that this type will be cut or otherwise significantly altered.

Simplify, Simplify

Many people believe that software starts out simple and becomes complex over time. In my experience, the exact opposite happens: Anyone can create a complex design, but considerable conscious effort and time are required to simplify one. In any case, the committee is actively simplifying parts of XQuery. I can safely predict that XQuery is a little more complex today than it will be when it is finalized.

Some complexity shows up as redundancy. For example, concat() is redundant with string-join(), and cast as is redundant with type constructors. However, neither of these is likely to be cut.

Some complexity shows up as “missing” functionality. For example, there is no group by operator and no apply() function (that would apply an operation to every member sequence), so users must write lengthier queries to accomplish the same tasks. At this point, the committee is unlikely to add any major new functionality to XQuery.

Some complexity shows up as irregularities. For example, some type conversions treat the xs:integer type (which is derived from xs:decimal) different from other derived types. Some prolog statements use an equal sign while others don't. Some string functions take an optional collation argument, and some don't. Although the committee continuees to smooth out the rough spots, some irregularities will inevitably remain in the final draft, if for no other reason than XQuery has been designed by many different people with different styles and goals.

Some complexity shows up as unnecessary case analysis (something the late Edsger Dijkstra loathed). The type conversion rules are an example of this; figuring out what happens in a query like the one in Listing 14.1 (or even whether it's legal at all) is too challenging in the current XQuery draft. I think the definitions of the type conversion rules will be greatly simplified for implementers, but probably with minimal changes to the final result seen by users.

Example 14.1. Tricky type conversions

declare function foo($n as xs:decimal) as xs:integer* {
  $n
};
1 + foo(2 + 3.0) + foo(4)

Built-in Functions

There is a general sense that the built-in function library is too large. Eliminating the duration subtypes would also eliminate about 20% of the existing library, and other functions might be eliminated or added or changed in function signature.

Standards Roadmap

Belying its central nature, XQuery is connected to almost every other W3C XML standard. It also consists of several documents itself. This section provides a rough overview of how they all interconnect. You may also find the diagram at http://kensall.com/big-picture/ enlightening (or at least entertaining).

Today

XQuery 1.0 is defined by more than ten different documents, spanning many thousands of pages. (See the Bibliography for complete references.) XQuery was set in motion by the XML Query Requirements document and the XML Query Use Cases document. The former describes the committee's objectives in creating XQuery, and the latter provides concrete examples that should be supported by XQuery. Together these two documents provide the framework within which XQuery was created.

The main document that ties them all together is XQuery 1.0: An XML Query Language. This document describes the aspects of the language that concern most users. This core language document is supported by several other documents that provide a formal theoretical basis for XQuery. Some of these documents are also shared with the XPath 2.0 and XSLT 2.0 standards.

For example, the XQuery 1.0 and XPath 2.0 Functions and Operators defines a vast library of built-in functions (as well as pseudo-functions that define operator behavior) shared between XQuery 1.0 and XPath 2.0.

As another example, the XQuery 1.0 and XPath 2.0 Data Model document defines the formal data model for XQuery. It is intricately connected with four older XML specifications: the XML 1.0 Recommendation, the Namespaces in XML Recommendation, the XML Information Set (Infoset) Recommendation, and the XML Schema 1.0 Recommendation.

XQuery and XPath also share the XQuery 1.0 and XPath 2.0 Formal Semantics document, which lays out a set of algebraic formalisms that govern the behavior of queries. This document is really only of interest to implementers and academics; end users don't need to know anything about it.

Additional XQuery specifications include the XQueryX document and the XQuery Serialization document. These two documents define an XML serialization format for XQuery queries and data models, respectively.

The XPath 1.0 and XSLT 1.0 Recommendations define query languages for XML that have some similarities to XQuery; XQuery draws on the past experiences of these, but is otherwise independent of them. The XPath 2.0 and XSLT 2.0 standards pick up where the previous versions left off. They are separate standards from XQuery, but are designed to share the common core described above.

Tomorrow

Beyond XQuery 1.0, there are many other standards in the pipeline, such as XML 1.1 and XML Schema 1.1, which address issues users have had with the first versions of those standards, as well as the possibility of someday an XQuery 1.1 or XQuery 2.0. XQuery 1.0 itself is tied to XML 1.0 and XML Schema 1.0, so future versions of those standards can only affect future versions of XQuery.

From an XQuery perspective, the main effect of XML 1.1 would be to allow many more characters in XML names and text, a change that could easily be accommodated by a future version of XQuery. The impact of XML Schema 1.1 on XQuery is mostly to address type system design issues that XQuery uncovered in XML Schema 1.0.

XQuery 1.1

There may never be another version of XQuery, but if there is, we can speculate as to some of the features it may add.

One feature that many people expected to appear in XQuery 1.0 is the ability to modify XML using the query language. Several proposals have been floated, but none made it into this first version of the language. I have more to say about this topic in Section 14.5.

Another feature that is near-and-dear to the hearts of many of the XQuery creators is support for full-text operations, including fuzzy-search capabilities. I say a little about this interesting topic in Section 14.6.

In XQuery 1.0, user-defined functions cannot overload one another or the built-in functions. It's possible that a future version might allow for more sophisticated function overloading, and of course extend the function library with even more built-in functions.

Another feature that received some discussion is the ability to create user-defined types directly in the query, without having to import an XML Schema. And finally, as users gain experience with XQuery 1.0, it's inevitable that certain features will be wished for, while some existing features turn out to be useless. Real-world experience will inevitably shape future versions of XQuery.

Data Manipulation

The ability to change, delete, or insert an existing XML instance using a query language goes by the name Data Manipulation Language (DML). The idea behind DML is to add keywords such as update, delete, and insert to the language, performing these operations instead of (or in addition to) constructing results.

DML is a tricky problem for several reasons, leading the committee to exclude it from the first version of XQuery. Some of these problems (such as view update and the “Halloween” problem) are already well known in relational database systems, but have not yet been completely solved for the XML domain. Other problems, such as node identity and consistency between parents and children, are peculiar to XML.

To give you an idea what form DML might take, the following three sections list proposals that have been made to add DML to XQuery.

XQuery DML

One possibility is to just take the query language as is, and add to it some keywords like update, delete, and insert. This is the most likely form that a future XQuery DML will take. These additional keywords would be usable at the top level, and probably also in place of the return clause in FLWOR. They might also be used as the branches of if/then/else statements, allowing for conditional data modification, and perhaps even in user-defined functions.

A few examples of what XQuery DML might look like are shown in Listing 14.2.

Example 14.2. Potential XQuery DML instructions

insert <Employee/> into doc("team.xml")

delete doc("team.xml")//Employee[@years < 20]

update doc("team.xml")//Employee[@id="E0"]/Expertise
       with <Expertise>XQuery</Expertise>

for $i in doc("team.xml")//Employee
delete $i/Expertise

if ($final) then delete $tmp else insert <x/> into $tmp

There isn't any public documentation on XQuery DML at this time, nor any information as to whether it may include data definition features such as index management.

SiXDML

SiXDML is a proposal by Dare Obasanjo for a simple XML data definition and manipulation language. It actually predates XQuery, and has since been reformulated to include XQuery syntax. It does far more than simple insert/update/delete, including index and collection management.

SiXDML uses collections to persist XML documents. Collections can be indexed, deleted, modified, and constrained (using schemas) through the SiXDML syntax. Several projects have already implemented SiXDML, including Xindice. Listing 14.3 provides a few examples of SiXDML.

Example 14.3. Sample SiXDML instructions

CREATE COLLECTION employees

INSERT doc("team.xml") NAMED team.xml INTO COLLECTION employees

CREATE INDEX val-index OF TYPE VALUE INDEX
       WITH KEY="@id", ELEMENT="//Employee" ON COLLECTION employees

XUpdate

XUpdate is a data manipulation language with an XML syntax, vaguely like XSLT. XUpdate defines operators for inserting, updating, removing, and renaming XML nodes. Like SiXDML, XUpdate has already been implemented by several projects, including X-Hive.

Example 14.4. Sample XUpdate instructions

<xupdate:modifications version="1.0"
                 xmlns:xupdate="http://www.xmldb.org/xupdate">
  <xupdate:insert-after select="//Employee[1]">
    <xupdate:element name="Phone"/>
  </xupdate:insert-after>
</xupdate:modifications>

Full-Text Search

XQuery 1.0 is focused on exact queries over mostly structured data. In many XML applications, however, data is less well-structured and/or user queries are more vague.

Documents such as this book, although containing some structure (chapters, sections, paragraphs, and sentences) are mostly unstructured. Or consider searching the Web for information, when you're not exactly sure what it is you seek. For such cases, full-text search provides ways to perform “fuzzy” searches and work with data that has large, unstructured text components.

Full-text search thus involves two main components: A “word breaker” or tokenizer that imposes structure (words, punctuation, etc.) on otherwise unstructured text, and a “score” or ranking that, instead of matching exact results, produces a number representing how closely the results correspond to the query. Full-text also usually includes “fuzzy” string matching (such as patterns that cover regional spelling differences, synonyms, homonyms, or even phonetic expressions) and word proximity.

The combination of full-text and structured query is especially powerful. Imagine searching, for example, for all sections in this book that contain the word “path” near “expression” and that occur in odd-numbered chapters. Imagine searching the Internet for all Web pages that contain a word similar to “abra cadabra” and a table with exactly three columns. Imagine searching the Library of Congress for all bills on global warming authored by a particular senator.

It's not clear at this time how full-text might be incorporated into XQuery, although most likely it will be done using functions such as score(),contains(), or proximity(). See the Bibliography for additional references, including a very detailed full-text use cases document.

Performance Benchmarks

Although XQuery is a brand-new language, already many people have been hard at work creating performance benchmarks for it. These benchmarks tend to combine the kinds of processing found in existing XSLT benchmarks (like XSLTMark) with the kinds of processing found in existing SQL benchmarks (like TPC), producing a class of benchmarks that are uniquely XQuery.

The XMark suite was one of the first XQuery benchmarks; it can be found at http://www.xml-benchmark.org. XMark includes a program for generating large documents, and a collection of 20 queries that test various XQuery features from simple path navigation to complex grouping. Notably, this benchmark also tests type conversion overhead (a common source of performance problems in real-world XML applications).

Two other very interesting XQuery benchmarks are the Michigan Benchmark and XOO7, both of which are focused on XQuery in a database setting. The Michigan Benchmark is a kind of mini-TPC for XML. It tests 45 XML operations, mostly path queries but also a few joins and even 7 update cases (including bulkloading XML into a database). The XOO7 benchmark ports the OO7 benchmark for OODBMSs to XML. It contains 23 XQuery queries containing mostly variations on single-level FLWOR expressions.

Conclusion

In this chapter, we considered a few topics that aren't found in XQuery 1.0 but may appear in future versions, including data definition and manipulation and full-text search. We also briefly touched on performance benchmarks for XQuery, and on changes that may occur between the time of this writing and the finalization of the XQuery 1.0 standard.

Further Reading

For more information on these topics, consult the references listed in the Bibliography. This book's Web site also lists errata and updates.

The (fascinating) collected works of Edsger W. Dijkstra can be found at http://www.cs.utexas.edu/users/EWD/, possibly the oldest “Web log” in existence.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.221.113