Chapter 13

What’s Missing?

13.1 Introduction

In Chapters 9, “XPath 1.0 and XPath 2.0,” 10, “Introduction to XQuery 1.0,” and 11, “XQuery 1.0 Definition,” you’ve read about the capabilities of XPath and XQuery. XQuery is a rich, expressive language for querying XML representations of data. You will also see in Chapter 15, “SQL/XML,” how SQL has been extended to use the expressive capabilities of XQuery in the context of a database, providing an ideal harness for XQuery in enterprise applications. While all of these are powerful languages for querying XML, they’re obviously not powerful enough to satisfy all needs.

Whether you are querying documents (in the sense of books, articles, papers, etc.) or more structured data with small snippets of text (the title and author of a book, or the name and description of a product in a purchase order), the matching expressions that you saw in Chapter 11 miss an entire class of searches – full-text searches. In Section 13.2, we explain what we mean by full-text searches and why (and how) they are different from queries with predicates over structured data. Then we discuss the W3C’s efforts to add some full-text capabilities to XPath and XQuery, and we compare the W3C’s current XQuery Full-Text drafts with some existing offerings in XML full-text search.

Another serious deficiency of XQuery 1.0 is the inability to alter the XML documents and other values that the language is designed to find. Updating data is a natural part of querying those data. Sure, pure search-and-retrieve languages are important tools, but real-world applications quite frequently require the ability to make changes to the XML that has been found and retrieved. While the Requirements for XQuery1 state that “Version 1.0 of the XML Query Language MUST not preclude the ability to add update capabilities in future versions,” there is no requirement that XQuery 1.0 provide those update capabilities.

As you’ll learn in this chapter, while vendors are filling the gaps with their own proprietary extensions to XQuery, both of these missing features are already under development in the W3C’s XML Query and XSL Working Groups.

13.2 Full-Text

13.2.1 What Is a Full-Text Query?

XQuery today lets you write queries that select data according to some criteria. For example, you can select movies where the running time is exactly 142 minutes, or where the year of release is between 1985 and 1990. But XQuery does not (yet) let you write Full-Text queries.

Most people have a general idea of what a Full-Text query does – it searches text-based information, given some words or phrases and some special operators. In this section, we give an informal description of what a Full-Text query is, and what makes it different from other (i.e., structured) queries. Note that in this section we are describing generic Full-Text concepts (and not specifically the W3C work on XQuery Full-Text), and we are using pseudo-syntax in our examples. We’ll look at the current W3C approach to XQuery Full-Text later in this chapter.

Words (Tokens)

A Full-Text query allows you to search for words inside text data. For example, with a Full-Text query you can search for the word “werewolf” in the title of any of our movies. The basic unit of Full-Text search is often referred to as a token rather than a word. Token is a more precise term – in Western languages, tokens generally map to words, though some search engines count some phrases, parts of words, and possibly punctuation, as tokens. In many non-Western languages (especially those where whitespace is not used to separate meaningful strings), there is no clear concept of a word, and a Full-Text engine has to make some tough decisions on where to draw token boundaries (or how to otherwise derive tokens from a piece of text). For the rest of this chapter we use the term word for convenience.

A common question from non-Full-Text users is, “If Full-Text search is about looking for words inside text, then XQuery already does that with the contains function. So what’s missing?” The contains function does not do a Full-Text search – it does a substring search. The main difference is that a Full-Text search will generally match only a complete word, and not just part of a string. For example, a Full-Text search for “dent” will not match a piece of text that contains the word “students,” but a substring search will.

Also, when running a Full-Text search, there is generally an assumption that the match will be case-insensitive,2 so that “dent” will match “DENT” as well as “dent” (and “Dent” and “dEnt” and “DEnt” and so on). With substring queries, matching is usually case-sensitive (depending on the collation used), so that the text being searched has to match the case of the search term.

Special Operations

A Full-Text query should be able to support some special operations that are not applicable to structured search (non-Full-Text search over dates, numbers, and short strings). These operations fall into four classes.

1. Word expansions. Often, when you query for a particular word (a query term), you actually want to find things that contain words that are related to the query term, as well as things that contain the query term exactly. For example, when you query for the word “mouse,” you might expect to find “mice.” This is the stem operation – find me all pieces of text which contain some word with the same linguistic root as the query term.3 You might also expect to find words that are close to “mouse” in a thesaurus – broader terms (mammal, animal), narrower terms (dormouse, field mouse), or related terms (rat, shrew). You might also want to correct for errors in the query term and/or in the text being searched. Spelling errors in the search term are common (how many CD buyers know how to spell “Kajagoogoo”?), but you may also want to forgive common typing errors (letters that are close to each other on a keyboard) or, in the case of OCRed4 text, common OCR errors (mistaking “i” for “1”).

2. Matching options. What factors are taken into account when deciding whether a query term matches a word in the text being searched? We have already looked at one kind of matching option, case. The answer to the question “does ‘DENT’ match ‘dent’?” depends on the case matching options for the query. Other common matching options are diacritics (consider or ignore diacritic marks) and wildcards (treat the query term as a string, possibly containing wildcards to be expanded, or as a literal string).

3. Positional (or “proximity”) operations. You may want to find things that contain both “Oracle” and “CEO,” but only when those words are discussed together. Of course, in order to know whether the words were actually related to each other in the text, the search engine would need to understand the meaning of the text – instead, we can approximate understanding by searching for occurrences of the two words near each other. A Full-Text search might allow you to specify “Oracle near CEO,” so something containing “the CEO of Oracle” will appear higher in the results list than “Oracle introduced a new product today. Their CEO ….” A Full-Text query might also allow you to specify the exact distance between the words, a distance range, and/or the order of the terms to match, as well as window notions such as “within the same sentence” and “within the same paragraph.”

4. Combining operations. Most structured query languages allow you to combine predicates with and, or, and not. A Full-Text query language should also provide those logical combinations, with a few extra wrinkles. For example, or might be complemented by an accumulate operation – while “dog or cat or mouse” returns anything that contains at least one of the terms dog, cat, or mouse, “accumulate (dog, cat, mouse)” ensures, in addition, that anything that contains all three terms ranks higher than anything that contains any two terms, which in turn ranks higher than anything that contains any one term, regardless of the number of occurrences of each individual term.

The not combiner is also a little different in Full-Text queries. First, the unary not (“not dog”) is famously difficult to execute efficiently with most Full-Text indexes. Many Full-Text languages disallow the unary not altogether – i.e., they only allow not as a combiner (“cat not dog”). Also, there is a strong case for a mild not in Full-Text query. The mild not says, “Don’t match this phrase, but don’t exclude text that contains it either.” For example, suppose you are researching government policy on housing, and you want to search for government bills that contain the word “house.” You don’t want the phrase “house of representatives” to trigger a match, because that’s not the sense of “house” you are looking for. At the same time, you don’t want to exclude everything that contains “house of representatives,” because then you would miss a lot of things that should match. So a search for “house” will bring back too many results, while a search for “house not (house of representatives)” will bring back too few. Only “house mild not (house of representatives)” will return everything that contains the word “house,” while ignoring any occurrence of “house” as a part of “house of representatives.”

Higher-Level Operations

If a Full-Text language implements a set of special operations such as those just discussed, then an expert searcher can get good, predictable results by combining words, phrases, and operators in intelligent ways. However, these operations are completely mechanical on the part of the search engine – they do not imply any real understanding of the content or of the user’s intent, and they put the burden of formulating just the right query on the user. This may work well if the user is an expert in both the subject domain and the capabilities of the Full-Text tool, but a Full-Text query engine should be able to do better than that. We discuss search more broadly in Chapter 18, “Finding Stuff” – for now, let’s briefly look at two ways to make Full-Text search smarter and easier.

Concept search allows you to search for a concept rather than a word or phrase. If you are a wine connoisseur, you might want to find everything that is about wine (the concept). You could start by searching for wine (the word), then apply stemming to find anything that contains the word “wine” or the word “wines,” then apply the thesaurus operation to find anything that contains narrow terms for wine (“merlot,” “chianti,” “zinfandel,” and so on). But it would be much better to be able to express directly in the Full-Text query language that you want to search for the concept of wine, e.g., by querying for “about(wine).” Some Full-Text query engines already offer a concept search – as computers get smarter and faster, we expect concept search to become more widespread and more accurate.

Suppose that your Full-Text query engine is not capable of concept search, or that you are particularly knowledgeable about your domain and the needs of your users. It is common for a search application to do progressive relaxation – that is, to progressively relax the query that the user typed in until some reasonable set of results (say, enough to fill a computer screen) is returned. In our example of a concept search for wine, we assumed that the wine researcher wanted to find everything that was about wine. In many situations, such as e-commerce, you want the first 20 or so results that most closely match the criteria the user typed in, and you want them very quickly – the complete set of possible results is not important. So a Full-Text query engine might provide support for query templates, which allow you to describe the relaxation steps. For an online bookstore search, the steps might be:

1. Search for all the words the user typed in, treated as a phrase, in the book title.

2. Search for each word the user typed in, in the author’s last name.

3. Search for each word the user typed in, with some spell-check expansion, in the author’s last name.

4. Search for each word the user typed in, anded together, in the title.

5. Search for each word the user typed in, ored together, in the title.

Note that both concept search and progressive relaxation could be considered as parts of an application rather than parts of a Full-Text query language. The language designer might say, “I’m giving you the basic building blocks to express any query – if you want concept search or progressive relaxation, go build it using these blocks.” Or she might say, “These higher-level operations are important and useful – I’ll put some constructs into the language so you can express them directly, without a lot of coding.” See also Section 13.2.5 for some discussion topics around what should go into a Full-Text query language.

Inexact Answers and Relevance (Score)

When you execute a regular structured query, you generally expect an exact answer. For example, if you search for all movies that were released in 1985, there is a single correct answer – there is no room for debate as to whether a particular movie was, or was not, released in 1985. When you run a Full-Text query, the result set can be subjective – the ordering of the results set always is.

Let’s take a simple example first – search for “Sheltie rescue.” We used Google (which is a search application) and found 118,000 pages on the Internet that contain the phrase “Sheltie rescue.” This is an exact answer – every page in the results set contains the phrase “Sheltie rescue” at least once, and every Full-Text engine that searches the same corpus (set of documents) should return the same results set. As we make the query more complex, as long as the operations are well-defined, the results set is predictable. However, some operations will bring in the engine’s “secret sauce” – e.g., thesaurus operations may use different thesauri – and this make the results set inexact.

The ordering of the results set, on the other hand, is always subjective. The Salton algorithm5 calculates a relevance score for a particular result by counting the number of times the query term occurs in the document. The algorithm takes account of how common the term is in the corpus overall, and some variations also take account of the length of the document. Most Full-Text engines use something based on this algorithm, plus (for web searches) some variation on the PageRank algorithm made famous by Google (see Chapter 18, “Finding Stuff”) to allocate a relevance score to each result. Results can be ordered according the relevance ranking (the size of the relevance score). But most, if not all, Full-Text engines then add some unpublished smarts – some “secret sauce” – to make their relevance ranking more effective.

In the early days of Full-Text development at Oracle, the Full-Text team devised some simple tests to measure the accuracy of the relevance scoring (and consequently the relevance ranking) of Full-Text query results. They took a fairly small corpus and had humans rank the results of some queries, and then they had their nascent Full-Text engine rank the results of those queries. They found that the Full-Text engine agreed with a human’s ranking only about 60% of the time. But they also found that humans agreed with other humans only about 40% of the time! Even when there is little ambiguity in the query term, such as a query for “dog,” people disagree wildly about how “doggy” a particular item is, and whether one item is more or less “doggy” than another.

In summary, relevance scoring by humans is highly subjective, and relevance scoring by computers is generally proprietary, used by Full-Text vendors to differentiate their engines. This makes the semantics of relevance ranking impossible to standardize.

Performance

A Full-Text query is also different from a substring query in the area of performance (or at least expectations of performance). When you run a Full-Text query, you expect it to run much faster than a full scan through all the text being searched. This superior performance comes, of course, from the existence of a Full-Text index. The most common form of Full-Text index is an inverted list. This consists of a list of all the words that occur anywhere in the corpus (the set of documents that is the universe of search). Associated with each word is a list of the items in which that word occurs. When you search for “dog,” the Full-Text engine looks up the word “dog” in the list and finds all the items where that word occurs.

There are some common variations on the inverted list structure. Many Full-Text indexes index n-grams rather than words. Here’s how n-grams work: If the item to be indexed is “Mary had a little lamb,” a word index would split this into “Mary,” “had,” “a,” “little,” and “lamb,” and track all items that contain each of those words. An n-gram index might split the same phrase into “M,” “Ma,” Mar,” “ary,” “ry,” “y,” “h,” “ha,” “had,” “ad,” “a,” etc. This example is based on a 3-gram or tri-gram index, though the “n” may have other values. An n-gram index is particularly useful when many queries have wildcards at the beginning and/or end of a word, for languages such as Chinese and Japanese where the exact “word” structure is difficult to determine, and for text that is inaccurate (because it was typed or OCRed poorly).

The structure of an inverted index dictates some characteristics of Full-Text engines:

• The inverted list structure makes it very expensive to add a new item. Whenever a new item is added to the corpus, the inverted list entry for each word in the item must be updated (extended). Most Full-Text indexes get around this by adding to the index asynchronously, so that they can add many new items at once, and by adding the index information for new items at the end of the list rather than in-place. This in turn may lead to fragmentation, which leads to the need for periodic index reorganization.

• The inverted list structure makes it even more expensive to delete an item. When you delete a single 10,000-word item, you must update 10,000 list entries. Most Full-Text indexes get around this by performing lazy deletes – marking items as deleted, but not actually deleting the index entries. Each query must then check whether a potential result has been deleted before returning it. This also leads to the need to optimize the index periodically (to perform the actual deletes), and it may lead to inaccurate result counts (that’s part of the reason for the “1-10 of about …” that you see on search pages).

• The inverted list may be optimized for a particular set of match options. For example, if most searches are case-insensitive, it doesn’t make sense to create a case-sensitive index, and then do a case-insensitive match of the index items at query time. Instead, the index is often built as case-insensitive – all the words are converted to the same case while they are indexed. This makes queries faster for the most common (case-insensitive) searches, but it makes case-sensitive searches impossible (or very slow – it’s possible to retrieve all the results of a case-insensitive search, and then to scan each one to see of it matches the case-sensitive search).

In summary, a Full-Text index increases the performance of Full-Text queries. It is generally built as an inverted list. This means that changes to the data are not (generally) reflected immediately in the index; that the index must be periodically optimized (manually or under the covers); and that some query options must be chosen at index build time.

13.2.2 Full-Text and XML

In Section 13.2.1, we were careful to talk about searching a collection of “things” or “items,” and returning “things” or “items” as results. In classical Full-Text search we talk about searching for documents, and returning documents as results – i.e., the document is the basic unit of search. As you read in Chapter 3, “Querying XML,” XML changes all that – in XML, we still have the notion of a document, but we generally search in parts of the document and return parts of the document (not necessarily the same parts).

For Full-Text search, this represents both a challenge and an opportunity. The challenge is to provide a language in which you can express which parts of a document you want to search, which parts you want to take into account when calculating relevance score, and which parts you want to return. The opportunity is that Full-Text search can be much faster and more accurate when searching/returning only parts of a document.

For example, suppose you have a set of journal articles that consist of title, author, date, and a set of headings, subheadings, paragraphs, and footnotes. A Full-Text search application that searches over XML, where each of those items is a separate element, can easily search across titles first, then headings, then subheadings, and then paragraphs and never search across footnotes. Assuming each kind of element is separately indexed, most searches will be very fast, since some results will be found in a very small title index. Results will be highly relevant, since a document with “dog” in the title is much more likely to be relevant to a search for “dog” than one with “dog” just anywhere in the document, including footnotes. And the results will be much more useful than the typical Full-Text result, which is (a pointer to) a whole document – with XML, you can return the actual paragraph that contains the words you are looking for, or the title of the journal + the interesting paragraph + the paragraph on either side.

In short, XML and Full-Text were made for each other. Given XML’s beginnings (in SGML) as a way of adding structure to unstructured text, we find it incredible that XQuery does not yet have (indeed, did not start with) a Full-Text capability.

13.2.3 Defining XQuery Full-Text

The W3C XQuery and XPath Working Groups have set up a Task Force – a subgroup, if you will, of the Working Groups – to come up with a proposal for an extension to XQuery, to be called XQuery Full-Text.6 This Task Force has published three documents – the XQuery and XPath Full-Text Requirements,7 first published in May 2003; XQuery 1.0 and XPath 2.0 Full-Text Use Cases;8 and XQuery 1.0 and XPath 2.0 Full-Text9 (language and semantics). The first of these documents (requirements) is now quite stable. The last two (use cases and language and semantics) are published at regular intervals as the work of the Task Force progresses – at the time of writing, the latest publication is dated November 2005.

The requirements document says that XQuery Full-Text must be properly integrated with the XQuery/XPath language, following the same universality rules, and must be composable with XQuery. It also lists the minimum set of Full-Text functionality that must be in the first release of XQuery Full-Text. The list is:

1. single-word search

2. phrase search

3. support for stop words

4. single character suffix

5. 0 or more character suffix

6. 0 or more character prefix

7. 0 or more character infix

8. proximity searching (unit: words)

9. specification of order in proximity searching

10. combination using AND

11. combination using OR

12. combination using NOT

13. word normalization, diacritics

14. ranking, relevance

This Requirements spec balances concerns over XQuery Full-Text being created as a hastily designed bolt-on to XQuery that would have to be fully integrated in following versions against concerns that the first XQuery Full-Text version might be either too simplistic to be useful or too full-featured to be released in a reasonable time frame.

The use cases are very complete – they provide examples of every corner of the XQuery Full-Text language. They serve not only as a motivation for each of the features, but also as a tutorial for the new user, and even as a basic test bed for implementations.

Before describing the current state of the XQuery Full-Text language spec, let’s look at some approaches that were not adopted by the Task Force – objects, functions, and many-functions.

Approaches – Objects

One obvious way to implement XQuery Full-Text would be to follow SQL’s lead. After all, ANSI and ISO had been down this path already – they defined SQL/MM (SQL Multimedia and Application Packages) Part 210 to extend the SQL language to incorporate Full-Text search (Part 1 defines the framework, and other parts define SQL support for Spatial and Image data). SQL/MM takes an objects-based approach. It defines a new UDT (user-defined type) called FullText, to represent any data that is Full-Text-searchable. It then defines methods on the FullText object, including a CONTAINS method to test whether a document matches a text query, and a RANK method to return the relevance score of a text query. Example 13-111 shows a table created with a column of type FullText and a query against that table. The query returns the docno for each row in the table where the document contains “standard” or “standards” in the same paragraph as a word that sounds like “sequel” (e.g., “SQL”). The results are ordered by the relevance score of the same text query.

Example 13-1   SQL/MM Full-Text

image

Approaches – Functions

The idea of reusing the definitions generated by another standards body might seem appealing. But XQuery is an expression-based language with functions – it does not have objects. The obvious way to graft the SQL/MM approach onto XQuery would be to define two functions, say, mmcontains and mmscore (the function name contains is already taken by a substring function). An XQuery based on these functions might look like Example 13-2.

Example 13-2   XQuery Full-Text, Functions

image

The advantages of this approach are:

• All the work is already done – the SQL/MM Full-Text definitions could be grafted onto XQuery with very little effort.

• Some database vendors have already implemented some form of SQL/MM Full-Text. It would be relatively easy for them to implement XQuery Full-Text using existing technology.

So the standard could be defined quickly, and at least some vendors could implement it quickly and easily. The disadvantages of this approach, though, are:

• The queries are verbose.

• More importantly, this approach does not address the requirement of composability.

That is, the string that makes up the second parameter is not a part of the outer language – it’s a string with its own “sublanguage.” Suppose you wanted to use such a query in your application but that you wanted the words “standard” and “sequel” to be replaced with variables (instead of being literals) – perhaps a user types them into some web page, perhaps they are derived somehow. You can’t express that easily in the XQuery – i.e., the string cannot contain variables (or expressions) where the search terms in the example are hard-coded. There are ways around this – you could allow some kind of string substitution in the parameter string, just as the Perl language does. Or you could just build up the string in a separate step, before calling the function.

Note that some of the verbosity comes from having to type in the sublanguage string twice, once for mmcontains and once for mmscore. If you want to be able to score (and therefore rank) results only on the same criteria you use to select items, you can avoid this. The filter function (mmcontains) could have a side effect – e.g., calling mmcontains might set a local variable to a score value. Several existing Full-Text implementations use a side effect to filter and produce a score in one step.

Approaches – Many-Functions

In the previous sections, we explored extending XQuery and XPath to handle Full-Text search by following the ANSI/ISO approach of introducing a special object and some methods, and then we looked at the possibility of introducing those methods into XQuery and XPath as two functions, which we (arbitrarily) called mmcontains and mmscore. A major objection to this approach is that it involves a sublanguage – the string containing the text query expression is, from XQuery’s point of view, just any old string (and not an expression). That means it’s clumsy to construct, and users must learn and use new operators that have similar, but not identical, semantics to operators they already use in XQuery (and, or, not, etc.). While there are clearly workarounds to these issues, many feel that this approach makes Full-Text a second-class citizen in the world of XQuery, that Full-Text is not quite (and, more importantly, never can be) a fully integrated part of XQuery under this scheme.

An alternative approach is to make every Full-Text operation a true, first-class XQuery function, doing away with the sublanguage string altogether. So, instead of the XPath expression

image

you would write

image

We have (arbitrarily) used a prefix “mf” (for “many-functions”). This could be part of a function-naming convention, or it could (with the right syntax) be a namespace prefix that identifies the mf set of functions, or it could be dropped altogether as long as the function names didn’t clash with existing XQuery function names. In this example, mfand-contains is a first-class XQuery function, and its arguments – “dog”, “cat” – are just regular XQuery strings. The drawback of this approach is obvious – instead of two functions with many operators (inside a sublanguage), this approach yields many functions. The maximum number of functions needed is twice the number of operators in the sublanguage, i.e., one function for each operator for contains, plus one function for each operator for ranking. However, on closer inspection it becomes clear that only combining operations and proximity operations (as described earlier in this section) need a function each, while matching options and word expansions need far fewer functions. Let’s rewrite Example 13-2 to see how this might work in practice.

Example 13-3   XQuery Full-Text, Many-Functions

image

In Example 13-3, we needed to introduce two functions, mf-sameParagraph-contains and mf-sameParagraph-rank, to express “match these words in the same paragraph,” but we only needed to introduce one function, mf-expand-words, to express stemming and soundex.12

Approaches – Summary

In this section, we described three alternative approaches to extending XQuery and XPath to do Full-Text search – objects, functions, and many-functions. The ability to do Full-Text queries is extremely important to XQuery – some might even say that XQuery without Full-Text is not a viable language. Some of the database and query vendors clearly appreciate this importance, and they have not waited for the W3C spec to become a Recommendation before implementing XQuery Full-Text in some form. As you will read in Section 13.2.6, the functions and many-functions approaches have already been implemented by at least one vendor.

In the next section, we look at the approach currently being pursued by the Full-Text Task Force of the W3C XQuery Working Group. We expect this will be part of some (near-)future XQuery spec.

13.2.4 W3C XQuery Full-Text – Grammar Extension

In this section, we describe the current W3C Working Draft specification of XQuery 1.0 and XPath 2.0 Full-Text. Now that we have a clear idea of what a Full-Text query is and have seen some approaches to XQuery Full-Text that were not adopted by the W3C XQuery Full-Text Task Force, let’s look at the approach that is, at the time of writing, expected to yield the W3C XQuery Full-Text language.

The approach the W3C Task Force is pursuing is an extension to the grammar of XQuery and XPath. This is the most ambitious approach – rather than using the existing extensibility mechanisms available in XQuery (i.e., functions), W3C XQuery Full-Text extends the XQuery grammar with additional grammar rules and keywords. The advantage of this approach is that W3C XQuery Full-Text is an integral part of the XQuery language – and it’s first-class in every way, fully composable, and it introduces some notions (such as ranking) that might carry over into non-Full-Text XQuery. The downside is that the XQuery Full-Text language syntax and semantics is brand new. That means it will take (has already taken) a long time to define, and we expect it will take vendors a long time to implement (longer, anyway, than an approach based on existing standards). That said, we are excited about the prospect of a standard, rich language for doing Full-Text queries over XML.

The XQuery Full-Text Requirements laid out a minimum set of operations that had to be defined in XQuery Full-Text for it to be generally useful. These operations (and more) are described as extensions to the XQuery and XPath grammar, so it’s appropriate at this point to look at the XQuery Full-Text EBNF rules. If you followed along with the discussion of the XQuery grammar in Chapter 11, you are already familiar with the style of the XQuery grammar rules. XQuery Full-Text “breaks in” to the XQuery grammar at rule 50 (48 in XQuery),13 where FTContainsExpr is defined as a variation on the comparison expression ComparisonExpr. The XQuery grammar says:

image

while the XQuery Full-Text grammar says:

image

The XQuery grammar rules employ a “cascading precedence” style, so the precedence of an operator is clearly fixed by its placement in the EBNE. A RangeExpr – on the left-hand side of the ftcontains keyword – can be an instance of an AdditiveExpr, which can be an instance of a MultiplicativeExpr, and so on all the way down the precedence tree, through PathExpr to PrimaryExpr and ParenthesizedExpr. That means that almost any expression, including a parenthesized expression, may appear on the left-hand side of ftcontains. If you have doubts about the precedence of any of the operators in the expression to the left of ftcontains, just put parentheses around that expression.

Above ftcontains in the precedence tree are or and and, so ftcontains binds more tightly than or and and.

image

This places the XQuery Full-Text extensions within the XQuery grammar. From the productions we have looked at so far, we can expect to see XQuery Full-Text with a where clause something like this: … where $i/title ftcontains …. On the right-hand side of the ftcontains keyword, we must see an FTSelection optionally followed by an FTIgnoreOption.

Let’s look at the FTIgnoreOption first, because it’s simpler. Note that you can only have one FTIgnoreOption per ftcontains, and if it appears at all it must appear at the very end of the expression. This is for simplicity – allowing FTIgnoreOption in other places in the expression would make queries very difficult to parse (for humans as well as computers).

image

FTIgnoreOption lets you say that you want to ignore some parts of the node sequence that you are searching. For example, you might want to search chapters of a book for the word “dog” but ignore footnotes. “… without content //footnote …” might achieve that objective. You could achieve the same thing with some clever XQuery coding, of course, but this kind of query is important enough in Full-Text search to have a special, more concise syntax.

The meat of the FTContainsExpr expression is in the rest of the right-hand side, the FTSelection grammar rule.

image

The FTSelection describes what some Full-Text languages call the text query expression – just to remind you of the context, the FTSelection is likely to appear in the where clause of a FLWOR (or in the predicate of an XPath), e.g., “… where <some node sequence> ftcontains <some FTSelection> ….” The right-hand side of FTContainsExpr consists of the words and phrases that you want to search for, the ways those words and phrases are combined (and, or, any, all), the parameters that affect what constitutes a match (e.g., the case-sensitivity option determines whether “dog” matches “Dog”), and some positional operators (within three to five words).

The FTSelection also allows a weight – the weight has no effect when doing a Boolean search (using the ftcontains keyword); but when this same expression is used in the description of score, a higher weight lets you say that some word is more important than some other word. So, if you search for “(dog weight 1.0) | | (cat weight 0.1)”, you can expect a doggy document to appear higher in the results list than an equally catty document.

Let’s drill down to the details of the FTSelection by walking down the tree of each of its components, starting with FTOr.

image

This is another example of cascading precedence, typical of the style in which the XQuery grammar rules are written. An FTOr is an FTAnd, optionally followed by “| |” (the Full-Text symbol that can be pronounced “or”) and another FTAnd, any number of times. Similarly, an FTAnd is an FTMildnot, optionally followed by “&&” (the Full-Text symbol that can be pronounced “and”) and an FTMildNot, any number of times. These two rules could, of course, be combined into a single rule, but this style of grammar makes it crystal clear that “&&” binds more tightly than “| |.” What does that mean? It means that the snippet “cat && dog | | mouse” is an FTOr, where the FTAnds are “cat && dog” and “mouse.” That means that a node that contains both cat and dog will match this snippet, and so will a document that contains only mouse. Another way of saying that “&&” binds more tightly than “| |” is that “&&” has higher precedence than “| |.”

It may surprise some readers that the not operator is unary in XQuery Full-Text – that is, it’s possible to say “! dog”; you don’t have to have a word or phrase on the left-hand side of the not operator. The unary not is famously difficult to implement with an index. An index typically stores information about which words a document or node contains, and it lacks any information about which words a document or node does not contain. So if you want to find out which items do not contain the word “dog,” it’s very difficult to find that out from an index – you have to assume some universe of discourse (every possible item) and subtract from that every item in the index that does contain the word dog. That said, some Full-Text users feel that this is important functionality. Of course, you can express a binary not by combining the unary not and the and, e.g., “dog && !cat” (dog and not cat).

W3C XQuery Full-Text also supports a mild not operation (see Section 13.2.1, under the subheading “Special Operations”). A mild not lets you specify that the occurrence of a phrase should not eliminate an item from the results set, so you can say “mexico mild not new mexico” to find items that contain “mexico.” Items that contain “new mexico” will not be excluded from the results, but neither will they be included, unless they also contain “mexico” on its own.

The last grammar rule in this snippet – FTWords – tells us what constitutes a word (or phrase). From this rule, the following can represent what you are searching for:

1. A string literal – “dog”

2. A string literal with more than one word – “Mother Mary comes to me”

3. A sequence of strings – (“Mother Mary comes to me”, “Speaking words of wisdom”, “Let it be”)

4. A variable – $myVariable

5. The context item –. (“dot”)

6. A built-in function – string-value (. /title [1])

7. A user-defined function – myFunctions:getUserInput()

8. An expression enclosed in {} –

    {for $i in collection (myRules/rule) return rule/searchTerms}

This list is not meant to be read as a definition, rather as a set of examples – (1) is a special case of (2), which is a special case of (3), since (2) is a sequence with only one member. The W3C XQuery Full-Text spec says that the FTWords “must evaluate to a sequence of string values or nodes of type “xs:string”. The result … is then atomized into a sequence of strings which then is being [sic] tokenized into a sequence of phrases…. If the atomized sequence is not a subtype of xs:string*, a type error … is raised.” So wherever you see FTWords in the grammar, you are guaranteed to get something that evaluates to a sequence of strings (which may have only one member). How does this sequence of strings get interpreted – for example, will the query try to match all the strings in the sequence, or any string in the sequence? This is dictated by the FTAnyallOption.

image

There are five possible values for FTAnyAllOption.

• “any” – each member of the sequence is interpreted as a phrase, and the FTContains expression evaluates to true if there is a match for any (that is, at least one) of those phrases in the text being searched.

• “all” – each member of the sequence is interpreted (again) as a phrase, and the FTContains expression evaluates to true if there is a match for all (that is, every one) of those phrases in the text being searched.

• “phrase” – the sequence is flattened into a single string (the strings are concatenated, with whitespace in between). The FTContains expression evaluates to true if there is a match for that new string, treated as a phrase, in the text being searched.

• “any word” – the sequence is flattened into a single string, and FTContains evaluates to true if any word in that new string matches a word in the text being searched.

• “all words” – the sequence is flattened into a single string, and FTContains evaluates to true if each word in that new string matches at least one word in the text being searched.

This deserves some examples! Example 13-4 shows the effect of each of these options on a string and on a sequence of strings. In the examples, the “Query snippet” is part of a W3C XQuery Full-Text query, for example:

image

The “Interpretation” is alternative W3C XQuery Full-Text syntax giving the same semantics.

Example 13-4   FTAnyAllOption

image

image

If no option is specified, the default behavior is the same as “any.”

By now, you should have some sense of how the Full-Text operations fit into XQuery. Let’s look briefly at the rest of the XQuery Full-Text grammar before describing some examples.

image

FTProximity lets you specify how close two words or phrases must be, as a number or range of words, sentences, or paragraphs.14 You can also specify whether or not the order of the words in FTWords must match the order in which they appear in the text. The FTProximity production also includes a way to specify how many times a word must occur (though FTTimes doesn’t fit with the name of the production – it’s not a proximity operation). FTProximity expands to one of the following:

• FTOrderedlndicator – “… ‘dog’ && ‘cat’ ordered …” matches text that contains both “dog” and “cat,” where “dog” comes before “cat.” If this option is “unordered” (the default), the order of the words in the text does not affect the match.

• FTWindow – “… ‘dog’ && ‘cat’ window exactly 5 words …” matches text that contains both “dog” and “cat” within a window of 5 words. The unit can be “words,” “sentences,” or “paragraphs.” All three units are implementation-dependent (they are part of the tokenization process). The window option is a shorthand for a common form of distance (see next item) – i.e., any expression using window could be rewritten using distance.

• FTDistance – “… ‘dog’ && ‘cat’ distance exactly 5 words …” matches text that contains both “dog” and “cat” exactly 5 words apart (i.e., with exactly 5 intervening words). The unit of distance can be “words,” “sentences,” or “paragraphs.” The distance can be specified “exactly,” as a minimum distance (“at least 3 words”), a maximum distance (“at most 7 words”), or as a range (“from 3 to 7 words”).

• FTTimes – “… ‘dog’ occurs at least 3 times …” matches text that contains the word “dog” at least 3 times. You can also specify an exact number of occurrences (“occurs exactly 5 times”), a maximum number of occurrences (“occurs at most 7 times”), or a range (“occurs from 3 to 7 times”).

• FTScope – “… ‘dog’ && ‘cat’ same sentence …” matches text that contains both “dog” and “cat” in the same sentence. The possible units for FTScope are “sentence” and “paragraph.”

• FTContent – “… ‘dog’ at start …” matches text that contains “dog” at the beginning (i.e., “dog” is the first word in the text). You can also specify that the word must be the last word in the text (“at end”), or that the word (or phrase) must match the entire content of the text being searched (“entire content”).

image

FTMatchOption dictates how words are to be matched – e.g., you might want to match words ignoring case or diacritics. This rule mixes what we previously described as matching options and word expansions – you could think of “expand this query term into all the words with the same linguistic root” as “match this query term, but consider stemming as a matching option.” FTMatchOption expands to one of the following:

• FTCaseOption – “… ‘Dog’ case sensitive …” matches text that contains the word “Dog” with an uppercase “D” and lowercase “og.” Other options are “lowercase” or “uppercase” (convert the query term to lowercase/uppercase before attempting to match it) and “case insensitive” (do not take case into account when attempting to match the query term).

• FTDiacriticsOption – “… ‘résumé’ diacritics insensitive …” matches text that contains “résumé without taking diacritics into account (e.g., “resume” will match). As well as the diacritics options “sensitive” and “insensitive,” you can specify “with diacritics” or “without diacritics.”15

• FTStemOption – “… ‘dog’ with stemming …” matches text that contains any word with the same linguistic root as “dog” (“dog” or “dogs”).

• FTThesaurusOption – “… ‘dog’ with thesaurus at ‘http://example.com/myThesaurus.xml’ relationship ‘BROADER TERM’ at most 2 levels …” matches text that contains “dog” or “canine” or “mammal.” Possible relationships include at least those defined in the ISO 2788 Thesaurus standard,16 but there may be additional (implementation-defined) relationships. For hierarchical relationships, you can also specify an exact/minimum/maximum/range of levels to be considered.

• FTStopwordOption – “… $q with stop words …” matches text that contains any word in $q, ignoring any stop words (words that have been defined by your implementation to be “noise words”).

• FTLanguageOption – “… ‘dog’ with language ‘Russian’ …” tells the query processor that the query language is “Russian.” This may affect the way case, diacritics, thesaurus, and stemming options are processed.

• FTWildCardOption – “… d.*g with wildcards …” matches text that contains any word that starts with a “d,” ends with a “g,” and has zero or more characters in between. If the option is “without wildcards” (the default) then “d.*g” is interpreted literally. Other possible wildcard designators are “.?” (zero or 1 characters), “.+” (one or more characters), or, e.g., “{3, 7}” (from 3 to 7 characters).

Note that most of the match options can be turned on or off (“with wildcards”/ “without wildcards,” “with stop words” / “without stop words”). Remember that queries can be built up using parentheses, from the grammar production for FTWordsSelection. See, for example, the use of parentheses in Example 13-6.

image

The remaining XQuery Full-Text grammar productions are reproduced here for completeness. At least some of these productions will almost certainly have changed in the spec by the time you read this, so we want to give you a complete snapshot of the grammar we are discussing in this chapter.

image

image

We urge you to browse the (most excellent) XQuery Full-Text Use Cases spec for examples of all the XQuery Full-Text operators. We have created a couple of examples later, with explanations, to illustrate just how expressive XQuery Full-Text is.

Score

We will not give a full description of score here, as the definition of score is likely to change significantly. The current draft spec shows score as a (second-order) function, taking as its argument a Boolean combination of FTContains expressions. Another possible approach is to make score a clause in the FLWOR expression (a bit like a let clause), binding a variable to the (second-order) evaluation of some FTContains expressions combined with the XQuery and or or. The relative importance of terms in the query is set by the weight operator in the right-hand side of the FTContains expression(s). The grammar for a score clause might look like this:

image

Some XQuery Full-Text Examples

Example 13-5 is a very simple XQuery Full-Text query that returns the year of every movie with “werewolf” in the title. Note that the query does not include any match options, so, e.g., case sensitivity, stemming, and stop words are all defaulted.

Example 13-5   Simple XQuery Full-Text

image

Example 13-6 is a less simple example. In this example we assume an additional element in the movies data sample, a description, and we assume that a variable $myInput exists, representing some input typed by a user.

Example 13-6   Less Simple XQuery Full-Text

image

image

The query in Example 13-6 returns the titles of all movies that are about American werewolves, according to some additional criteria, ordered by their American werewolf-ness. Let’s start with the where clause – there are three criteria here, and if a movie meets any of those criteria it will get into the results set.

• The title contains “American,” with an uppercase “A” and lowercase “merican,” and it contains any word with the same stem as “werewolf” (“werewolf,” “werewolves”). The “werewolf” words may be in any mix of cases. The word “American” and the “werewolf” word must occur within a distance of two words or less, and “American” must occur first.

• $x is a variable provided by the context of the query – perhaps it’s a list of words that we want to match for every query, perhaps it’s a list of words that get computed somewhere. Any movie where the title contains all the words in $x will get into the results set.

• The third criterion is a FLWOR expression, enclosed in “{ }”. Let’s say this expression somehow calculates words that have been typed in by a user. Since the words have been typed in (i.e., we don’t know when we are writing the query what those words will be), we want to ignore some “noise words.” We don’t want to specify a list of (possibly hundreds of) stop words in the query, so we reference some known stop word list via a URL (the language says this only has to be a string literal, but a URL makes sense here). Then again, we don’t want to ignore all the stop words on that reference list – if the user types in “in” or “about” or “around,” then we do not want to ignore any of those.

So much for defining what goes into the results set. Now, in what order should we deliver the results? We use the score clause to calculate a relevance score, and then we order the results using the order by clause. Note that we could just repeat the whole query as part of the score clause, but we have decided we want to rank the results based on different criteria than the ones used to decide which movies appear in the results set. We want movies with “werewolf” in the title to appear very high in the results, and we want movies with “american” in the title to appear somewhat high in the results. We also want movies with either of those words in the description to get a slight boost. That’s all we can say about this query – we cannot predict the score for any particular movie, for the scoring algorithm and the effect of weights are implementation-defined. Note that the two FTContains expressions in the score clause are ored together, using the XQuery or and not the Full-Text | |. That’s because the score clause binds a variable to some set of (second-order) FTContains expressions, combined with the XQuery Boolean and or or.

Our last example is an XPath example. The Full-Text extensions are designed to be useful as part of XPath 2.0 as well as XQuery 1.0. Example 13-7 produces a sequence of titles of movies whose description contains either “American” at the start of the description or “werewolf” at least 3 times in the description.

Example 13-7   XPath Full-Text

image

If you want to try out some examples for yourself, we recommend going to the W3C’s XQuery home page and following the link to the XQuery Full-Text Test page.17 Here, you can type in an XQuery Full-Text expression and see its parse tree (though not its results). This is an excellent way to check the syntax of any XQuery and to get a feel for how XQueries are parsed.

Semantics

We cannot leave this section without mentioning the semantics part of the XQuery Full-Text Language and Semantics spec. A great deal of care and effort have gone into formally defining the semantics of those parts of the XQuery Full-Text language that can be formally defined – i.e., those parts that are not implementation-defined. This formal definition is an important part of the spec, but we prefer to use these formal semantics as a safety net, to help us figure out what behavior is expected when the descriptive sections are incomplete or unclear.18 We urge the interested reader to browse the semantics section of the spec, especially if she intends to build an XQuery Full-Text engine. (The XQuery Full-Text semantics section is rather dense. As an introduction to the concepts used in describing the XQuery Full-Text semantics, you might want first to read the TexQuery paper19 on which it’s based.)

13.2.5 W3C XQuery Full-Text – Some Discussion Topics

At the time of writing, the W3C XQuery Full-Text Working Draft is still fairly fresh – for example, score has only recently been converted from a function to a grammar extension. Since the spec is likely to change over the next few drafts, we don’t intend to describe the language in any more detail here – the interested reader now has a good idea of where to look, and what to look for, and has some context in which to read the latest specs. Instead, we introduce some discussion topics around XQuery Full-Text. This gives some insight into what kinds of issues may be on the table when you read this book, and what parts of the spec may still change. Then we describe some of the XQuery Full-Text implementations that are already commercially available, way ahead of any W3C spec that might be called a “standard.”

Definition of score

XQuery Full-Text has two major parts. FTContains is a Boolean expression with a precise definition.20 score is a way of measuring the relevance of each result to the query – for each match, just how good a match is it? The XQuery Full-Text Language spec gives a lot of detailed semantic definition to say exactly what should be the result of the evaluation of any particular FTContains expression. But what can we say about score?

Certainly we must define the data type that results from the evaluation of score – XQuery is a strongly typed language, and we need to know whether score results in an integer or a float. It’s fairly clear that we should also define the range of possible values for score; otherwise, scores from different implementations cannot be compared. But what does the value of the score actually mean? At the time of writing, the score clause binds a value to a variable, and that value must be of data type xs:float, in the range [0, 1]. So far, so good, but with this definition I could build an XQuery Full-Text engine that always returns 0 (or 1, or 0.5, or …) for score and claim compliance with the spec.

The spec also says that for “score values greater than 0, a higher score must imply a higher degree of relevance.” Yet relevance is implementation-defined, and it is in any case highly subjective. We believe this is a good rule to have, but it can only be followed in spirit (and not to the letter).

Then there is the question of the relationship between the value of score and the Boolean value of FTContains for the same query over the same data. Should a score of 0 imply that FTContains is false? Conversely, should an FTContains value of false imply that the score is 0? This is currently noted as an issue – we believe that, for now, the value of score and ftcontains should not be bound together.

Finally, there is the notion of score using second-order functions. Informally, a second-order function is one that has to be processed as a string, not one that can be evaluated immediately. In the current XQuery Full-Text spec, the score clause binds a variable to an expression. But this is no ordinary expression – a query processor cannot evaluate that expression on its own and then substitute the result in the query. The expression is always an FTContains (or a series of FTContains expressions combined using and or or), which would always evaluate to a Boolean. Rather, the processor has to consume this special expression as if it were a string – “what would the score be for a query that consisted of just these FTContains expressions?” XQuery has so far avoided second-order functions, so the current score clause is something “special.”

How Standard?

The questions around score (earlier) bring up the question of just how “standard” XQuery Full-Text can possibly be. The basic algorithms for calculating score, and hence ranking results by relevance, are well known – we have described Salton’s algorithm and PageRank elsewhere in this book (see Chapter 18, “Finding Stuff”). But search engine vendors generally add their own “secret sauce” to these algorithms, each claiming that its engine gives better ranking than any other. That means that we cannot standardize the semantics for score21 – it has to be implementation-dependent.

Similarly, tokenization – the exact method for breaking Full-Text down into tokens (usually words) – is implementation-dependent. So an XQuery Full-Text engine can choose to tokenize text, for example, so that the tokenized output contains the stemmed values for each word in the text (all words with the same linguistic root as each word in the text). Such a tokenizer would result in every XQuery Full-Text search being a stemmed search, even though the spec says that stemming is optional and that the default is not stemmed.

Add to this list thesaural search (the contents of thesauri are not standardized) and stemming (the semantics of stemming is not defined), and it’s clear that two compliant W3C XQuery Full-Text engines could give a wide range of results for the same query.

We believe that it is worthwhile to standardize the syntax of Full-Text search. While some would argue that we should stop there, we also believe that it is worthwhile to standardize the semantics of Full-Text search as much as possible, though we must also recognize the limitations of this approach.

Delineating Embedded XQuery Expressions

One of the goals of the W3C XQuery Full-Text language is that Full-Text operations should be a first-class part of the XQuery language. That means we need to be able to mix Full-Text and other kinds of expressions freely in a query. It’s difficult to do that unless you know where the Full-Text productions apply and where the XQuery productions apply. The easiest way out of this dilemma is either to introduce a new kind of parentheses that enclose a Full-Text expression (actually an FTSelection, the right-hand side of a Full-Text expression), or to introduce a new kind of parentheses that enclose an XQuery expression when it occurs inside an FTSelection. At the time of writing, XQuery Full-Text uses “{}” to enclose an XQuery expression inside an FTSelection, but this may change.

Defining Query Options Indirectly

It should be clear to the reader by now that a Full-Text query has a lot of options – the language, thesaurus (or thesauri), stopword list(s), etc., that provide the context of a matching operation. Some of these options can be fully described as part of the query – e.g., the language – but others cannot – it’s not practical to reproduce a 300-word stopword list or a 10,000-term thesaurus as part of each query. That means we need a mechanism to describe some options indirectly – “I want to use the English maritime stopword list and the English pharmaceuticals thesaurus for this query.” Possibly, we also need a mechanism for defining those pointers (“When I say English maritime stopword list, I mean the following words …”). Probably, we need some mechanism for setting defaults, e.g., the default thesaurus to use, in the query prolog.

API vs. Solution

When looking at the progress of XQuery Full-Text, we should draw a clear distinction between a Full-Text API and a Full-Text solution. It is the API that should be standardized – a set of low-level functions and/or expressions that can be used as building blocks for real-world solutions. Any suggestions that higher-level functionality (such as the best way to search across enterprise documents) should be standardized are to be resisted.

13.2.6 XQuery Full-Text – Some Implementations

Let’s switch back to talking about XQuery Full-Text in general – the W3C XQuery Full-Text spec is still quite new and incomplete, and none of the major vendors (to our knowledge) has implemented this draft spec. There are some experimental implementations of W3C XQuery Full-Text, such as Galax – go to http://www.galaxquery.com/ for details. Some of the major software vendors have found other ways to include Full-Text search with their XQuery implementations.

IBM, Microsoft

Neither IBM nor Microsoft has, at the time of writing, announced plans to support W3C XQuery Full-Text. Both have been active on the task force (as can be seen from the list of editors’ affiliations), so they may well implement XQuery Full-Text when it becomes part of the standard.

Microsoft currently recommends using SQL Server’s Full-Text indexing capabilities to create a Full-Text index on XML documents, and doing Full-Text search to prefilter documents before applying XQuery.22 This approach is better than nothing, but it’s a long way from XQuery Full-Text – it only allows you to search across whole XML documents (minus element tags and attributes).

We haven’t managed to find any information on IBM’s plans in this area. IBM does offer Full-Text search in a number of its products, and they are active on both the XQuery Working Group and the XQuery Full-Text Task Force, so we look forward to hearing about their XQuery Full-Text product plans.

Oracle

Like Microsoft, Oracle also provides Full-Text search as part of its database. But the Oracle Text index, and its operators CONTAINS and SCORE, are XML-aware – it’s possible to search for a text query expression (words and phrases combined with Booleans, stemming, thesauri, fuzzy match, etc.) either WITHIN a specified element or attribute or in an XPath (INPATH). Oracle’s CONTAINS follows the spirit of the SQL/MM spec – that is, it’s a function that takes two arguments, a column name and a string containing a text query expression (describing what you are searching for). The text query expression language does not follow SQL/MM exactly, but it does provide an equally rich set of features.

Oracle’s XQuery implementation also includes a vendor-defined extension function, ora:contains, that provides Full-Text search as part of XQuery or XPath. The ora : contains function follows the “functions” approach described earlier in this chapter – that is, it’s a single function that takes in a node or item (what to search over) and a string representing the text query (what to search for) and returns a Boolean (matched or not matched).

These two approaches represent the classic “Who’s on top?” dilemma for Full-Text search over XML in a mixed environment.23 Is it better to express the query as “dog and cat within/movies/movie/title” – making Full-Text “on top” and restricting the results with an XPath? Or is it better to express the query as “/ movies/movie/title[. ora:contains “dog and cat”] “ – making the XPath “on top” and describing the Full-Text search on individual elements? With Oracle’s CONTAINS and ora:contains, you can do either. In fact, you can do both – and combining CONTAINS and ora:contains in the same query gives you a lot of the benefits of XQuery Full-Text without having to wait for a W3C XQuery Full-Text Recommendation. ora:contains can be used in any of Oracle’s SQL/XML or extension functions that use XPath, so, for example, you can locate a document using CONTAINS and then pull out the interesting parts of that document using extractNode ( ) with an XPath argument that uses ora : contains.24

In the latest version of Oracle’s database, Oracle has implemented a pre-Recommendation version of XQuery, callable via the SQL/XML extension functions XMLQUERY and XMLTABLE. The ora:contains XPath function can be used as part of an XQuery in either of those functions to provide XQuery Full-Text in a SQL context – see Example 13-8.25

Example 13-8   XQuery Full-Text, Oracle

image

image

Mark Logic

Mark Logic26 has implemented a variation on the “many-functions” approach described earlier in this chapter. The MarkLogic Content Server includes an XQuery implementation extended with a set of functions for Full-Text search. The MarkLogic Server supports a function cts:search($searchable-expression, $search-query). cts:search is similar to Oracle’s ora:contains, except that the second argument can be made up of other cts functions. There is one function to perform a simple word or phrase search (cts:word-query); one function for each of the combining operations (cts:and-query, cts:or-query, cts:and-not-query, cts:not-query); plus one function to describe proximity operations (cts:near-query). An optional argument to these functions allows you to specify matching options and word expansions (case sensitivity, punctuation sensitivity, stemming, wildcards, and language).

With this compromise – using arguments rather than functions to represent matching options and word expansions – the number of functions is kept manageable, but the language is fully integrated with XQuery. There is no string argument that represents an opaque sublanguage, so any of the query terms can be derived from XQuery expressions. One could argue that this is more composable than the grammar extension approach – in this many-functions approach, the matching options and word expansions are strings, which could presumably be derived from XQuery expressions, while in the grammar extension approach they are language keywords, which cannot.

Mark Logic claims their XQuery Full-Text functions map closely to the functionality in the W3C XQuery Full-Text working drafts, providing a lower-level API that can be used to implement the W3C XQuery Full-Text functionality when it becomes a Recommendation. See Example 13-9 for examples of W3C XQuery Full-Text queries and their equivalents using the Mark Logic language.

Example 13-9   XQuery Full-Text, W3C vs. Mark Logic

image

The Mark Logic language does have some differences in approach (as opposed to merely differences in surface syntax) from the W3C working draft. Most important, in the Mark Logic language, score is implicit – i.e., the cts:search function returns results implicitly ordered by score. This makes the queries shorter, at the cost of some control. And there are some detailed differences, representing shortcuts – e.g., if a query term is in all-lowercase, it has case-insensitive matching options by default; if it is in mixed-case or uppercase, it has case-sensitive matching options by default. These differences aside, Mark Logic’s language is an important step in the direction of XQuery Full-Text.

13.3 Update

The second major piece of functionality missing from XQuery 1.0 is the ability to update XML documents, collections, repositories, messages, etc. In this section, we discuss the reasons why updating XML is necessary and how it differs from updating records in ordinary flat files, rows in tables of relational databases, or objects in object-oriented databases.

We then review the requirements that the W3C’s XML Query Working Group has published for the planned XQuery Update Facility, examining each of the requirements in turn. Following that review, you will learn about some current proposals for XQuery Update Facility syntax and semantics and how they respond to those Requirements, after which we take a look at how a few XQuery vendors already deal with the problem of updating XML documents.

Finally, we speculate a bit on the future of the XQuery Update Facility, including a bit of “reading the tea leaves” to guess when a real standard will emerge.

13.3.1 Motivation: Where/Why We Need Update

For as long as computer systems have been used to store data (as opposed to merely manipulating them), it has been necessary to change stored data in many ways. As data have become more complex, more integral to our businesses and our day-to-day lives, the necessity to effect change to the data has only grown.

In the beginning (well, not the very beginning – only after the advent of magnetic media), data were usually stored directly on some storage medium, such as a tape, a drum, or a disk. Those media had different organizational mechanisms, which made the paradigms for accessing them very different – and the paradigms for changing their contents even more radically different. Of necessity, file systems were born; they provided a way of shielding almost all programs from the tedious details of the physical media on which data were stored while (normally) providing a kind of catalog to correlate a mnemonic name with a specific set of data.

But it didn’t take long for users of computer systems to realize that their needs for data organization were not always satisfied by these “flat” files (that is, files without inherent structure). The concept of databases caught on very quickly because of the tremendous additional power they gave applications for protection of, organization of, and access to data stored in them. Databases gave more structure to data, whether the model on which the database was designed was hierarchical, network, relational, object-oriented, or what have you. The presence of such structure makes it possible to identify, retrieve, and manipulate data at a more granular level than feasible with less structured data. And that granularity has everything to do with how all those data can be modified.

If the data in question comprise a long string of bits without clearly determinable internal boundaries (such as an executable image) and you want to make a change, it may be necessary to replace the entire string of bits – because it’s so difficult, if not impossible, to clearly identify the piece of the bit string that needs to be changed – instead of altering smaller parts of the string. Replacement of very large chunks of data is generally rather expensive, especially if only a tiny fraction of the data they contain is actually being altered.

When the data do have clearly identifiable boundaries, either because of some sort of “marker” carried in the data themselves (e.g., field marks and record marks) or because of an external description of the boundaries’ locations (structural metadata), then it becomes much easier to modify smaller pieces of those data. Consequently, the application changing the data becomes a bit more robust in the face of changes to the details of the data’s structure, maintenance requirements for that application are reduced, and the data modifications themselves are more easily verified.

XML is all about structure. The tree-structured nature of XML generally provides for a highly granular mechanism for representing data, whether those data are the semistructured data representing literature marked up for publication or the highly structured purchase orders for DVDs being sold at a video store. In fact, one of the primary values of XML is exactly that: expressing the structure needed to make data more useful.

It is thus quite tempting to view XML data oversimplistically, imagining that the task of modifying a portion of an XML document to be no more than a trivial replacement of that portion, perhaps just replacing the value of an attribute or inserting a new element and its children or deleting part of a tree. Unfortunately, the world isn’t quite that simple, in part because XML is used in enormously varied ways. XML documents might be stored in flat files, in rows of relational databases, or in native tree structures of so-called pure XML databases. But XML is also used to mark up highly transient information, such as weather or stock data being broadcast to subscribers’ cell phones.

What, then, does it mean to “update” XML data? It can mean many things: replacement of an entire XML document with a new one, deletion of an XML document, creation of a new XML document, modification of some part of an XML document by replacing one or more elements with new elements, insertion of new elements or attributes, removal of PI nodes, and so forth. Clearly, any truly useful update capability must be able to accommodate this wide variety of actions.

In the next section, as we explore the requirements that the W3C has published for an update facility to augment XQuery, the varied nature of updating will emerge.

13.3.2 Requirements

In early 2005, the W3C published a first public Working Draft of the Requirements for an XQuery Update Facility.27 These requirements are analogous to the Requirements28 published several years earlier for an XML Query Language. They specify, at a high level, the fundamental requirements for an update facility for XQuery without saying anything at all about how such a facility might be designed to satisfy those requirements. In this section, we’ll look at each of the requirements in turn, elaborating on those that are not obvious.

Usage Scenarios

The first thing found in the Requirements is a statement of the XML update usage scenarios that the Working Group believes it is necessary to address. (“Usage scenarios” are not the same thing as “use cases,” the latter being more detailed. No use cases have been published for the XQuery update facility at the time of this writing.) The scenarios are:

• Updating persistent XML stores – This involves modification of XML stored on a persistent medium, such as in a file or a database.

• Modifying XML messages – In this scenario, messages could be altered either to add information generated as they are being processed or to update their statuses.

• Add to existing XML document – New data can be added to XML documents, such as appending new entries to a data log.

• Updating XML registries – Configuration files, user profiles, administrative logs, etc. can require updates.

• Creating edited copies – A new copy of an existing XML document or subtree in a document, such as a modified web page, can be created.

• Modifying XML views – It is possible to create an XML view of non-XML data,29 leading to the desirability of updating those non-XML data through the XML view of it.

General Requirements

The Requirements include a number of items that don’t fit neatly into any of the other categories discussed in this section, but that are seen as overall requirements for the XQuery Update Facility. They are:

• Query Update Syntax – Just as XQuery is defined using a human-readable syntax and an XML syntax (XQueryX), the Update Facility should have both kinds of syntax. Among other things, this will ease the process of integrating the Update Facility into XQuery.

• Declarativity – XQuery is a declarative language, and its Update Facility should also be declarative in nature. One of the advantages of this is that many optimization strategies will be possible.

• Protocol Independence – XQuery was designed to have no dependency on any sort of networking protocol, and the Update Facility will have the same characteristic.

• Error Conditions – When errors occur during the process of performing an update, the error that is reported will ordinarily be one of the standard errors defined by the Update Facility, using the same conventions that XQuery uses.

• Static Type Checking – XQuery has an optional static typing feature, and the Update Facility will provide a parallel feature.

Relationship to XQuery 1.0

The XQuery Update Facility’s relationship to XQuery itself has implications on the design of the Update Facility.

• Based on the Data Model – XQuery operates not on the serialized representation of XML documents, but on the Data Model representation. The Update Facility must use the same paradigm in order to integrate with XQuery.

• Based on XQuery – The Update Facility must identify items to be updated or deleted and to be used as reference points for insertions. It will use XQuery to identify those items as well as to construct new values for updates and insertions.

XML Query Update Functionality

This set of requirements identifies the specific capabilities that the XQuery Update Facility must have in order to be useful.

• Locus of Modifications – The Update Facility will be capable of altering properties of existing nodes in a Data Model instance without changing the identity of those nodes, as well as of making copies (with new identities) of nodes.

• Delete – The Update Facility will be able to delete nodes from Data Model instances.

• Insert – The Update Facility will be able to insert new nodes at specified positions in Data Model instances.

• Replace – It will be possible to replace nodes in Data Model instances.

• Changing Values – It will also be possible to alter the values of nodes (that is, the value that the typed-value accessor defined by the Data Model would return) in Data Model instances.

• Modifying Properties – It may be possible to modify a number of properties of nodes in Data Model instances, such as the name of the node, the type of the node, the nilability of the node, and so forth. The specific properties that can be changed will vary between kinds of nodes (e.g., changing the name of a comment node is not a meaningful concept).

• Moving Nodes – The Update Facility might provide a way to move a node from one place in a Data Model instance to another place. Problems such as node identity may make it infeasible to implement this functionality.

• Conditional Updates – The Update Facility will provide a way to specify that certain operations are performed only when certain conditions are met. This might be done, for example, through the use of some sort of “if” expression or statement, or a “case” expression or statement.

• Iterative Updates – It will be possible to specify operations that are invoked on all nodes in a sequence of nodes generated by some sort of iterative expression (one example of such an expression in XQuery is the for expression in a FLWOR expression).

• Validation – XQuery allows for the results of expressions to be validated against the schema definitions that apply to them; such validation makes a copy of those results and thus changes the identities of the items within the results. The Update Facility might provide a way to validate the results of one or more changes without changing the identity of the nodes involved.

• Compositionality – It will be possible to compose the Update Facility operations with one another, and it might be possible to compose such operations with XQuery expressions (for example, allowing an update operation to be specified everywhere that XQuery expressions are permitted).

• Parameterization – The Update Facility might provide a way to parameterize the various update operations that it defines.

Transaction Characteristics

Whenever multiple changes are made to data, there is a need to ensure that the end result is consistent with the intent of those changes. Most data management systems provide some sort of transaction capability to provide that consistency. A popular and widely respected book30 uses the term ACID to describe the properties of transactions.

• Atomicity – The Update Facility will specify some number of atomic (indivisible) operations and provide a way to group operations together so that the entire group is treated as atomic.

• Consistency – After an update has been performed, the Data Model instance that was modified will be consistent with respect to the Data Model constraints, and all type annotations will be consistent with the types of the items that they annotate. The Update Facility might also provide a way to force this level of consistency at a finer granularity, such as an individual insertion.

• Isolation – The Update Facility might define a way by which the operations that it defines can be executed in isolation from any changes made to the data from any other activity.

• Durability – The Update Facility will specify the manner in which the changes made to Data Model instances are caused to be durable. “Durable” is not the same as “persistent” – that is, changes to a Data Model instance are durable if the Data Model retains the changes until their effects are altered by additional update operations or until the Data Model instance ceases to exist for any reason.

13.3.3 Alternatives: Syntax and Semantics

There are any number of decisions that have to be made when defining an update capability for any query language. When the query language in question is a functional language without side effects, there are additional decisions to be made. XQuery is such a functional language and has been carefully designed to have no side effects (other than those that might be caused by the invocation of external functions).

But the very nature of an update capability is that side effects are intended to occur. This apparent conflict can be approached in several ways. One way is simply to abandon the notion of XQuery as a side-effect free language – we think it is extremely unlikely that the XML Query WG in the W3C would entertain such a notion.

Another way is to define a second language – let’s call it the XQuery Update Language – that invokes XQuery expressions or even incorporates XQuery as a proper subset. The disadvantage of following this path is that implementers and users alike are faced with having to deal with two (apparent) languages. We doubt that the XML Query WG would choose this approach because of the possible confusion it might cause.

A third approach is to isolate the components that can cause side effects (such as insertion, deletion, and replacement) into something entirely new in XQuery: statements. As you read in Chapter 11, “XQuery 1.0 Definition,” XQuery comprises only expressions that are contained in modules. If the concept of XQuery statements were added to the mix, then it might be acceptable to partition the language into statements that are permitted to cause side effects and expressions that are not.

A fourth alternative is to define two kinds of expression: those that are free of side effects and those that may cause side effects. The two kinds would be clearly delineated by, for example, using a particular keyword (e.g., update) to introduce every expression that might cause a side effect.

We believe that the real choice is between those last two alternatives. In the short term (XQuery 1.0 time frame), it’s almost certain that the Update Facility will be published as a separate W3C Recommendation. Why? Because the XML querying community urgently needs the Update Facility and is unlikely to wait for the next edition of XQuery to be published before adopting a solution, and XQuery 1.0 is highly unlikely to be delayed until the Update Facility definition can be completed. However, we think it’s probable that the Update Facility will be more properly integrated into some future edition of XQuery itself.

The choice of statement vs. expression will have significant impacts on the syntax that the Update Facility will define; it will also have implications on the semantics of the language. For example, if the approach of defining update statements is taken, it seems quite unlikely that the semantics of a particular update statement will impact the semantics of any expressions used by the statement. But if updates are defined as kinds of expressions, the ability to compose all kinds of expressions together makes it likely that the semantics of update expressions will impact the semantics of expressions with which they are composed.

Questions that must be answered include the following:

• Should there be a concept of an XQuery “program,” analogous to XQuery modules, in which update statements are allowed?

• Should XQuery functions be allowed to contain update statements or expressions? If so, should there be explicit syntax to mark, in XQuery library modules, the signatures of such functions as having side effects?

• Should there be explicit syntax for starting and terminating a sequence of update operations that are intended to function as an atomic unit (such as a SQL’s START TRANSACTION and COMMIT/ROLLBACK statements that start and terminate transactions), or should that be done by some other facility (such as a transaction manager external to the XQuery implementation)?

• What implications are caused by update expressions or statements on the declarative nature of XQuery as a whole? What kinds of rules might be required to limit the ability of implementations to “reorder” the evaluation of expressions based on the presence of update operations?

The nature and extent of the questions that must be addressed when designing an update capability are sufficiently difficult that the design is not trivial and will be very time-consuming.

Proposals that we have seen for the XQuery Update Facility have included both the “update statement” and “update expression” approaches. They generally propose statements or expressions to insert new XML values both as children of existing elements (as first sibling or as last sibling of existing children) and as the preceding or following sibling of existing nodes. In addition, they propose facilities to delete existing XML nodes and subtrees, to rename existing nodes without changing their identities, to alter the contents of existing nodes (again, without changing their identities), and so forth.

None of the proposals we’ve seen have addressed the issue of transactions. That may be wise at this stage, because the subject is known to be controversial, with some participants believing that explicit syntax should be provided for terminating (as well as, perhaps, initiating) transactions, while others believe that such matters are better left to the environment in which XQuery and the Update Facility exist.

We believe that the XQuery Update Facility was delayed relative to XQuery 1.0 itself in order to allow the Working Group to focus on the base expression language without the additional distractions and workload that concurrent work on an update capability would have caused. Now that XQuery is closing in on the final Recommendation stage, we expect that the XQuery Update Facility will receive more attention.

Unfortunately, many XQuery applications don’t have the luxury of waiting until the W3C publishes a final Recommendation for an XQuery Update Facility and then for implementations of the Facility to be delivered. Instead, those applications have little choice other than electing to use the proprietary features of current XQuery implementations and adopting the eventual Recommendation when it has been delivered and implemented.

13.3.4 How Products Handle Update Today

In the absence of even a public Working Draft for an XQuery Update Facility provided by the W3C and the existence of marketplace requirements for the ability to update XML documents, collections, messages, etc., suppliers of several XQuery implementations have provided update solutions using several strategies.

In this section, we briefly explore the approaches taken by a few of the more prominent implementers of XQuery processors as well as an approach being proposed by a group that develops specifications related to XML databases.

Oracle

Oracle, like the other implementers discussed in this chapter, does not support an XQuery syntax-based approach to updating XML documents. Instead, Oracle provides three mechanisms for responding to users’ needs in this area. Applications must choose between the three strategies and cannot, for example, locate and retrieve data using one technique and update it using another.

One approach is a DOM-based API (part of Oracle’s XMLDB), which requires programs to invoke methods on a DOM object representing an XML document. That API is based on the W3C’s Document Object Model,31 which provides methods for traversing the DOM representation of an XML document, retrieving values from desired nodes, and updating the document by inserting nodes, deleting nodes, modifying the values of nodes, and so forth.

A second approach is a Java API that defines a class to represent the XML type, along with methods such as deleteXML(), insertXML( ), and updateXML( ). The Java API is heavily used by Oracle’s customer base, and it is likely to be augmented by the emerging XQJ (XQuery API for Java) API32 under development in the Java Community Process. You’ll read more about XQJ in Chapter 14, “XQuery APIs.”

The other approach, more directly related to XQuery, is based on SQL/XML (about which you will read in Chapter 15, “SQL/XML”). In SQL/XML, applications use ordinary SQL statements to access XML data stored as values of the XML type in columns of database tables. A SQL function, XMLQuery ( ), allows applications to evaluate XQuery expressions against those XML values. Oracle extensions33 to SQL/XML provide the ability to update those XQuery expressions. The extension functions are:

• UpdateXML ( ) – Replaces an existing node (or node sequence) in an XML value (cannot be used to insert or remove nodes).

• InsertChildXML ( ) – Inserts a node as a child of an existing node in an XML value, using an associated XML Schema to determine the correct insertion location among the new child node’s siblings.

• InsertXMLBefore( ) – Inserts a node as the immediately preceding sibling of an existing node of an XML value.

• AppendChildXML ( ) – Inserts a node as a child of an existing node in an XML value as the very last sibling of the existing node’s current children.

• DeleteXML() – Deletes an existing node from an XML value.

Unfortunately, Oracle’s extension functions are not true update functions, in that they do not update an XML value “in place.” Instead, they are really transformation functions that return an updated copy of the value on which they operate. Therefore, they are side effect-free functions. When used in a context (such as a SQL UPDATE statement) where values are being replaced, then these functions have the semantics of update functions.

Similar functions are provided by Oracle in their XML SQL Utility (SMU), which is part of their PL/SQL language.

We would anticipate that, once an XQuery Update Facility has been defined in the W3C, Oracle would update their XQuery implementation to support the Update Facility, in addition to the other strategies they already provide for updating XML.

IBM

At the time of writing, IBM had not yet released an XQuery implementation. Naturally, they have not released an XQuery update capability either.

IBM has, however, offered a tool that “provides an XML view of relational tables and a query of those views as if they were XML documents.” This tool, named XML for Tables, is available34 on IBM’s alpha Works site. The tool, which implements a subset of XQuery, translates XQuery expressions into SQL expressions that are executed by DB2. The tool does not, however, appear to have any ability to update the data behind those XML views.

IBM has frequently indicated that a future version of DB2 will support XQuery directly. Presumably that support will include XQuery updating capabilities at the same time. We presume that IBM will add support for the XQuery Update Facility when that spec has been finalized.

Microsoft

Microsoft’s SQL Server 2005 provides an implementation of (a subset of) XQuery. As part of that implementation, they have built some extensions that support XML updates. Unlike Oracle, these extensions do not depend on SQL/XML capabilities, but are provided through a method operating on instances of their XML type.

That method, modify ( ), is invoked as part of the SET clause of a SQL UPDATE statement. The parameter of the method is a character string that expresses one of several types of updates, including insert, replace value of, delete, etc., as well as an XQuery expression that identifies the nodes to be modified and the values to be used for the modifications.

Microsoft provides a set of .NET classes used with XML. Some of these classes provide a wide variety of methods that support setting the values of nodes, inserting nodes into specified locations, deleting nodes, and replacing nodes. These classes and their methods are unrelated to XQuery, but they serve roughly the same function as the classes and methods of Oracle’s DOM-based API.

Microsoft offers yet another approach to updating XML, based on a facility they call updategrams. An updategram “represent[s] changes to an XML instance that, when combined with an annotated schema, persist these changes back to relational changes.”35 Updategrams are defined to be XML documents using a specific vocabulary that captures updates to the (relational) data that underlies an XML view.

The vocabulary uses elements <before> and <after> as children of a <sync> element. Each <sync> element represents a single update to the XML view, with the <before> and <after> elements providing the values of the “existing” state of the data being updated and the new value resulting from the update, respectively. If a particular <sync> element contains a <before> element and not an <after> element, it represents a deletion of data; the presence of an <after> and not a <before> element indicates an insertion, and the presence of both corresponds to a modification of existing data. (Some observers suggest that an updategram is, in effect, isomorphic to a change log that records actions taken to update such data.)

As the XQuery Update Facility reaches its final development stages, we expect that Microsoft will enhance SQL Server’s XQuery support to align with the Update Facility specification.

Mark Logic

The MarkLogic Content Server includes an XQuery implementation extended with a set of functions for update. The functions are:

• document-insert($uri, $root-node)

• document-delete ($uri)

• node-replace($old-node, $new-node)

• node-delete($old-node)

• node-insert-before($node, $new-sibling-node)

• node-insert-after($node, $new-sibling-node)

• node-insert-child ($parent-node, $new-child-node)

The MarkLogic Server evaluates an XQuery module by spawning an evaluation thread that operates on a snapshot of the database. All the updates appearing in the module are collected, and a change vector (i.e., a time-ordered list of changes) is computed and journaled. The XQuery module has an implied commit that guarantees that either all the changes or none of the changes become persistent.

Mark Logic update evaluation guarantees atomicity, isolation, and durability. Consistency is satisfied, but only in a relatively weak sense, since integrity constraints are not explicitly part of the Mark Logic architecture. Integrity is expressed via denormalized documents: All the elements collected in one document are maintained as a consistent collection. Isolation is maintained in the strongest sense: Each query evaluation thread operates on a static snapshot, and all the updates refer to the original values, even if an update subexpression modifies one of the values appearing elsewhere in the XQuery module

In addition, the MarkLogic Server implements the WebDAV protocol and therefore a complete document properties infrastructure. Updates to document properties are expressed using the following functions:

• document-set-properties($uri, $prop-list)

• document-add-properties($uri, $prop-list)

• document-remove-properties($uri, $prop-name-list)

• document-set-property($uri, $prop)

XUpdate

We’ve seen how four data management vendors are dealing with XML update requirements today. It’s also worth mentioning another sort of standards activity in the open source community: XML:DB is “an industry initiative … [that] provides a community for collaborative development of specifications for XML databases and data manipulation technologies.” It also provides open source reference implementations of their specifications.

Among the specifications being developed by XML:DB is one named “XUpdate.” (Perhaps “has been developed” would be more appropriate, as the latest draft’s date would suggest.) The most recent draft of the XUpdate specification36 is dated 2000-09-14, but there continues to be a certain amount of industry buzz about XUpdate and its applicability to the strong requirement for a language to update XML documents.

XUpdate instructions are in the form of an XML document that contains elements in the xupdate namespace (http://www.xmldb.org/xupdate). XUpdate uses XPath 1.0 as the expression language used to select XML nodes for further processing by the instructions embedded in those elements. Each update must be enclosed in an <xupdate:modifications> element. The update elements themselves are:

• <xupdate:insert-before>

• <xupdate:insert-after>

• <xupdate:append>

• <xupdate:update>

• <xupdate:remove>

• <xupdate:rename>

• <xupdate:variable>

• <xupdate:value-of>

• <xupdate:if>

Most of these update elements correspond very closely to the kinds of updates supported by Oracle’s and Microsoft’s approaches and to the W3C’s requirements for an XQuery Update Facility. In fact, one might argue that XUpdate’s syntax is reasonably analogous to Microsoft’s updategrams.

<xupdate : append>, <xupdate : insert-after>, and <xupdate : insert-before> may have child elements: <xupdate : element>, <xupdate : attribute>, <xupdate : text>, <xupdate : comment>, and <xupdate : processing-instruction>.

<xupdate:update> has a select attribute containing an XPath expression that identifies the node set to be updated. The content of the <xupdate:update> element supplies the new content of the elements in the identified node set.

<xupdate:remove> has a select attribute containing an XPath expression that identifies the nodes to be removed.

<xupdate:rename> has a select attribute containing an XPath expression that identifies a set of element or attribute nodes (the other node types are not allowed). The content of the <xupdate:rename> element provides the new name of the identified nodes.

<xupdate:variable> has two attributes: name (which specifies the name of the variable being created) and select (which specifies the value assigned to that variable). The value of a variable is obtained through the use of the <xupdate:value-of> element, whose select attribute provides the name of the variable whose value is wanted. <xupdate:value-of> can be used in other situations as a child element, where it provides a value to be used in evaluating that child element. The syntax and semantics of these elements are similar to those of the analogous variable declaration facility in XSLT (<xsl:variable> and <xupdate :value-of>). <xupdate:variable> and <xupdate:value-of> have no obvious correlation to anything in the update facilities of the vendors we’ve surveyed in this section.

The XUpdate specification has a heading for Conditional Processing, but there is no content under that heading. However, the XUpdate DTD indicates that the <xupdate:if> element has an attribute named test, whose content must be a Boolean expression, that determines whether or not the contents of the <xupdate:if> are evaluated. The contents of the element can be any mixture of text and child elements <xupdate:element>, <xupdate:attribute>, <xupdate:text>, <xupdate:processing-instruction>, and <xupdate:comment>.

A Google search for XUpdate turns up over 30,000 references, many of which describe implementations of the incomplete five-year-old specification. While XUpdate contains some very useful ideas and has received some industry attention, we don’t believe that it is a long-term solution to the problem of integrating XML updating facilities into XQuery. The primary difficulty is that XUpdate uses an XML syntax instead of XQuery’s more user-friendly syntax. Another, less critical, reason is that the specification is incomplete in several ways: The syntax and semantics of some elements are not specified at all, and the descriptions of the other elements are imprecise and can be interpreted by different implementers in different ways.

13.3.5 What Lies Ahead?

We’ve dusted off the old crystal ball and found that it’s got too many cracks to be easily read. But we’ll go out on a limb to suggest that the XQuery Update Facility will be fully specified not too long (six months or so) after this book goes to press and that it will be embraced fairly quickly by a good cross section of XQuery implementations. It’s too soon to tell whether the specification will provide update statements or update expressions – but it will surely not contain both! Our personal experience leads us to prefer the approach of using update expressions, because we think it will be more flexible and powerful in the long run and can be specified more elegantly (perhaps more simply) than the update statement approach. Time – and the XML Query Working Group – will tell.

13.4 Chapter Summary

In this chapter, we’ve discussed two of the most important features of an XML query language that are missing from XQuery 1.0: a Full-Text search capability and an Update Facility.

We’ve given you a summary of the issues involved in the two facilities and laid out the current state of their development in the XML Query and XSL Working Groups. We’ve also shown how some implementations are already providing these capabilities to their users.

Both of these are already under consideration in the W3C and may be expected to hit the streets some time in 2006.


1XML Query (XQuery) Requirements, W3C Working Draft (Cambridge, MA: World Wide Web Consortium, 2003). Available at: http://www.w3.org/TR/2003/WD-xquery-requirements-20031112.

2Sometimes you do need case-sensitive Full-Text search. For example, you may want to distinguish between an occurrence of a word in general use and an occurrence of the same word used as somebody’s name – you may want to search for the word “melt”, but not find the person’s name “Melt.” This is tricky, because a case-sensitive search for “melt” will miss any occurrence of “melt” at the beginning of a sentence. Perhaps more useful is a search that is case-insensitive unless all letters in the word are uppercase – with that rule, you can distinguish between the word “dare” and the acronym “DARE” in your query.

3The word “mouse” is interesting in a couple of respects. First, it’s one of those words whose plural form is not just the singular form with an “s” on the end. Any simple substring search for “dog” will match things that contain “dogs,” but will not match “mice” if your query term is “mouse.” Second, “mouse” arguably has two plural forms – Steven Pinker argues that the plural of “computer mouse” is “computer mouses” (Steven Pinker, The Language Instinct : How the Mind Creates Language [New York: Morrow Publishing, 1994]).

4OCR (optical character recognition) is technology for converting a hard copy of a document into a soft copy – typically the document is scanned into an image, and then a software program “looks” at the image to convert it into characters.

5Gerard Salton, Automatic Text Processing (Reading, MA: Addison-Wesley, 1989). A simple overview of the Salton algorithm is available on the web at: http://www.oracle.com/oramag/oracle/01-mar/o21int.html#SCORE (Douglas Scherer and Carol Brennan, Exploring Oracle Text Basics, Oracle Magazine [March 2001]).

6One of the authors is the chairman of the W3C XQuery Full-Text Task Force, the other is a founder/member of the Task Force and editor of some of the specs.

7XQuery and XPath Full-Text Requirements, W3C Working Draft (Cambridge, MA: World Wide Web Consortium, 2003). Latest version available at: http://www.w3.org/TR/xquery-full-text-requirements/.

8XQuery 1.0 and XPath 2.0 Full-Text Use Cases, W3C Working Draft (Cambridge, MA: World Wide Web Consortium, 2005). Latest version available at: httpavailable at: http://www.w3.org/TR/xmlquery-full-text-use-cases/.

9XQuery 1.0 and XPath 2.0 Full-Text, W3C Working Draft (Cambridge, MA: World Wide Web Consortium, 2005). Latest version available at: http://www.w3.org/TR/xquery-full-text/.

10ISO/IEC 13249-2:2000, Information Technology – Database Languages – SQL Multimedia and Application Packages – Part 2: Full-Text (Geneva, Switzerland: International Organization for Standardization, 2000).

11This example was adapted from an example in Jim Melton and Andrew Eisenberg, SQL multimedia and application packages (SQL/MM), SIGMOD Record, Vol. 30, Issue 4: 97-102 (New York: Association for Computing Machinery, 2001). Available at: http://portal.acm.org/citation.cfm?id=604264.604280.

12There is an issue here. We presented mf-expand-words as a real, ordinary XQuery function. In the case of stemming (with some means of identifying the language), this could be implemented as a real function returning a finite list of words. However, in the case of some of the match options, it would be impractical to implement the function naïvely. For example, the result of mf-expand-words (“a*b,” “wildcards”) is, in theory, an infinitely long list of words. In practice, it is usually finite (bounded by the set of words in the text index). Even when there is no convenient bounding mechanism, it is possible to execute such a function in the context of some particular query, even if it’s not possible to return a result for the function when it is used stand-alone. That said, some people are uncomfortable defining a language that depends on such special functions.

13Grammar rules are quoted from the XQuery Full-Text Working Draft and the XQuery 1.0 Working Draft, both of November 3, 2005, available at: http://www.w3.org/TR/2005/WD-xquery-full-text-20051103/ and http://www.w3.org/TR/2005/WD-xquery-20051103/, respectively. When you read this, the numbering of the rules (and of course the rules themselves) might be different.

14All of these units (words, sentences, paragraphs) are implementation-defined.

15There is a subtle difference between “diacritics insensitive” and “without diacritics.” “diacritics insensitive” tells the query processor to ignore diacritics when computing a match, so the occurrence of the query term in the text, with or without any diacritics, will result in a match. “‘resumé’ diacritics insensitive” matches “résumé,” “resume,” “résume,” “resme,” etc. On the other hand, “without diacritics” says that a term only matches text that does not contain diacritics, whether or not the query term contains diacritics. “‘résumé’ without diacritics” matches only “resume.” Similarly, “with diacritics” only matches text that does contain some diacritics, even though the query term may not contain diacritics. So “‘resume’ with diacritics” matches “résumé,” “resumé,” and “resme,” but does not match “resume” (in case the user knows there are diacritics in the text to be matched, but doesn’t have a diacritic-capable input screen).

16ISO 2788:1986, Documentation Guidelines for the Establishment and Development of Monolingual Thesauri (Geneva, Switzerland: International Organization for Standardization, 1986).

17The W3C XQuery home page is at http://www.w3.org/XML/Query. At the time of writing, the latest XQuery Full-Text test page is at http://www.w3.org/2005/04/qt-applets/xquery-fulltextApplet.html.

18The XQuery Full-Text Language and Semantics spec does not say which description is normative, but we suspect that the formal semantics description is intended to be the normative definition for operations that are formally defined there, rather than the description of behavior in the language sections.

19S. Amer-Yahia, C. Botev, J. Shanmugasundaram. TeXQuery: A Full-Text Search Extension to XQuery (2004). Available at: http://www.cs.cornell.edu/database/TeXQuery/.

20The definition of FTContains is precise where it can be, but many parts of FTContains are left implementation-defined or implementation-dependent (see the next subsection, “How Standard?”).

21Comparing scores from Full-Text searches across different data sources – even if the searches are done by the same engine using the same scoring algorithm – is famously difficult.

22S. Pal, M. Fussell, and I. Dolobowsky, XML Support in SQL Server 2005 (Redmond, WA: Microsoft Press, 2004). Available at: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/sql2k5xml.asp.

23By “mixed environment” we mean one where there is a set of capabilities for structured search and a separate set of capabilities (e.g., a set of functions) for Full-Text search. In the W3C XQuery Full-Text grammar, the two sets of capabilities are integrated enough that you can mix’n+match them however you want – they are fully composable.

24For an example, see Searching for Content and Structure in XML Documents, Oracle9i Database Daily Feature. Available at: http://www.oracle.com/technology/products/oracle9i/daily/nov30.html. See also Oracle XML DB Developer’s Guide 10g Release 1 (10.1), Chapter 9: Full Text Search Over XML. Available at: http://download-west.oracle.com/docs/cd/B14117_01/appdev.101/bl0790/xdb09sea.htm#sthref885.

25Stephen Buxton, Querying XML, XML 2004 Proceedings (IDEAlliance, 2004). Available at: http://www.idealliance.org/proceedings/xml04/abstracts/paper218.html. This example was taken from Stephen Buxton, Muralidhar Krishnaprasad, and Zhen Hua Liu, Querying XML (Oracle Openworld 2004). Available at: http://download-west.oracle.com/oowsf2004/1136_wp.pdf.

26See http://www.marklogic.com.

27XQuery Update Facility Requirements, W3C Working Draft (Cambridge, MA: World Wide Web Consortium, 2005). Available at: http://www.w3.org/TR/xquery-update-requirements/.

28XML Query Requirements, W3C Working Draft (Cambridge, MA: World Wide Web Consortium, 1999). Available at: http://www.w3.org/XML/Group/1999/11/xmlqueryreq-19991104/. (The most recent version is available at: http://www.w3.org/TR/xquery-update-requirements/.)

29ISO/IEC 9075-14:2003, Information Technology – Database Languages – SQL – Part 14: XML-Related Facilities (SQL/XML) (Geneva, Switzerland: International Organization for Standardization, 2003).

30Jim Gray and Andreas Reuter, Transaction Processing Concepts and Techniques (San Mateo: Morgan-Kaufmann, 1994).

31Document Object Model (DOM) Level 3 Core Specification, W3C Recommendation (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/TR/DOM-Level3-Core/.

32Information about JSR 225: XQuery API for Java (XQJ) can be found at http://www.jcp.org/en/jsr/detail?id=225.

33Oracle XML Java API Reference 10g Release 1 (10.1), XDK Java Classes (Redwood Shores, CA: Oracle Corporation, 2004). Available at: http://download-west.oracle.com/docs/cd/B14117_01/appdev.101/M2024/toc.htm.

34http://www.alphaworks.ibm.com/tech/xtable.

35S. Pal, M. Fussell, and I. Dolobowsky, XML Support in SQL Server 2005 (Redmond, WA: Microsoft Press, 2004). Available at: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/sql2k5xml.asp.

36A. Laux and L. Martin, XUpdate WD (2000). Available at: http://xm.ldb-org.sourceforge.net/xupdate/xupdate-wd.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.177.86