Chapter 3

Retrieving Data from XML Documents

IN THIS CHAPTER

Bullet Discovering XQuery

Bullet Finding out about FLWOR expressions

Bullet Comparing XQuery with SQL

Early in the development of XML, one of the primary concerns was the conversion of data stored in SQL databases to XML so that it could be transmitted to other, incompatible data stores or so that it could be displayed on the web. In Book 6, Chapter 1, I describe several SQL functions whose purposes are to perform such conversions. Converting XML to SQL is also an important endeavor because SQL has traditionally been the premier tool for extracting the information you want from a collection of data. This conversion usually takes the form of shredding, in which an XML document is torn apart and pieces of it flow into the columns of tables in an SQL database. Then queries can be made by using normal SQL SELECT statements.

Querying XML documents directly, without shredding them into an SQL database, is much more complicated; thus, that capability took a while longer to implement in standard form. In addition, many common examples of XML documents aren’t readily shredded. The tree structure of an XML document can be difficult to translate into the row-and-column structure of a relational database. Consequently, several years of development of XML query facilities have been required to produce the XQuery standard, currently at version 3.1, which defines how to query an XML document directly and retrieve the information you want.

XQuery, like XML itself, is a vast topic, which I don’t cover in detail here. I describe it briefly, however, and give some examples of its use. These examples are in no way comprehensive but should give you an idea of what you can do with XQuery. For an in-depth treatment, I recommend Querying XML, by Jim Melton and Stephen Buxton (published by Morgan Kaufmann Publishers).

XQuery

XQuery is a nonprocedural language specifically designed to retrieve desired information from XML documents, just as SQL is a nonprocedural language specifically designed to retrieve desired information from relational databases. Whereas relational databases are highly structured, XML documents can be characterized as semistructured. What an XML document looks like varies a lot more than what a relational database looks like.

Remember When I say that SQL or XQuery is a nonprocedural language, I mean that a query — whether expressed in SQL or XQuery — describes what to do but not how to do it. The how is left up to the engines that process the SQL or XQuery code.

Where XQuery came from

XQuery is the result of combining the best parts of several predecessor languages. For a long time, it’s been clear that being able to query XML documents directly has great value. Several groups worked on the problem and came up with a variety of query languages. One of those languages is XQL, written in 1998 by Jonathan Robie, who worked for Software AG at the time. An unrelated language, also named XQL, was created at Fujitsu Labs at the same time, but never developed beyond the prototype stage.

At about the same time, a language named XML-QL emerged from a collaboration of several researchers. Stanford University joined the game with a project named Lore and a language named Lorel, which had object-oriented characteristics. At Institut National de Recherche en Informatique et en Automatique (INRIA), the French National Institute for Research in Computer Science and Control, a research language named YATL was developed. Rounding out the predecessors, a language named Quilt was developed by Don Chamberlin (one of the authors of SQL), Jonathan Robie, and Daniela Florescu, all of IBM.

Although XQuery probably owes more to Quilt than to any of the other predecessor languages, it takes ideas from all the others and benefits from all that has gone before.

What XQuery requires

The XQuery 3.1 language specification defines what must be true of a language for it to qualify as an XQuery implementation. Actually, it defines three levels. Some things must be true, others should be true, and some other things may be true. XQuery is evolving, and it’s not yet clear what will be mandatory in the future.

Among the things that the XQuery 3.1 language specification defines are the following requirements:

  • XQuery is a declarative language, and as such, it must not mandate an evaluation strategy. It describes what the processor should do, not how it should do it.
  • XQuery may have more than one syntax binding, but it must have one syntax that’s convenient for humans to read and one syntax expressed in XML that reflects the underlying structure of the query.
  • XQuery must define standard error conditions that can occur during execution of a query.

XQuery 1.0 doesn’t have any update capability, but an extension named XQuery Update 1.0 does. However, at the time of this writing, not all vendors support XQuery Update 1.0. Oracle and Microsoft, among others, offer proprietary update solutions (discussed in Book 6, Chapter 2).

XQuery functionality

The XQuery requirements document specifies several things that an XQuery implementation must do, as well as things that it should do and things that it may do. Here are some of those functionality requirements:

  • XQuery must support operations on all data types in the XQuery data model.
  • Queries must be able to express simple conditions on text, including on text that spans element boundaries. Element boundaries are specified by tags such as <element></element>. Text may go beyond such a boundary. If it does, queries must be able to express simple conditions on that text.
  • Operations on collections must include support for universal and existential quantifiers. “All” is an example of a universal quantifier and “some” is an example of an existential quantifier. When Aristotle said, “All men are mortal,” he was making a statement about all men, a universal statement. When he said, “Some men are Greeks,” he was making an existential statement. A least one man is a Greek, but one cannot conclude from the statement that all men are Greeks.
  • XQuery must be able to combine related information from different parts of a given document or from multiple documents.
  • XQuery must be able to compute summary information from a group of related document elements (aggregation).
  • XQuery must be able to sort query results.
  • XQuery must support NULL values.
  • Queries should be able to operate on literal data. Literal data is the data itself, rather than some representation of the data.
  • Queries must be able to perform simple operations on names, such as testing for equality in element names and attribute names, and processing instruction targets. Queries may perform more powerful operations on names.
  • XQuery should support the use of externally defined functions on all data types of the XQuery data model.
  • XQuery must be able to provide access to environmental information, such as current date, time, and time zone.

These requirements are a partial list. XQuery 3.1 meets all the requirements I have listed as must or should. Other requirements in the XQuery Requirements may or may not have been met by XQuery 3.1.

Usage scenarios

The World Wide Web Consortium (www.w3.org) has developed a set of 77 use cases that cover 9 categories of queries. In each case, a query is applied to the supplied input data, and the expected results are given. You can use these use cases as a starting point in testing an XQuery implementation to see whether it’s more or less working. An exhaustive test suite, which tests every possibility, would take thousands of such cases, but these 77 are a good start. In this section, I show you just one such case.

Because this book just happens to be about SQL, one category of data that you may want to query using XQuery is data stored in a relational database. You can find such a case at paragraph 1.4.1 of the XML Query Use Cases document, which is available at www.w3.org/TR/xquery-use-cases. I reproduce that case in this section.

The case takes data from a simplified version of an online auction. Three tables are involved: USERS, ITEMS, and BIDS. The USERS table contains information on buyers and sellers. The ITEMS table lists items currently for sale or that have recently been for sale. The BIDS table contains all the bids on record.

Here are the tables and the columns that they contain:

USERS (<b>USERID</b>, NAME, RATING)

ITEMS (<b>ITEMNO</b>, DESCRIPTION, OFFERED_BY, START_DATE, END_DATE,

RESERVE_PRICE)

BIDS (<b>USERID</b>, <b>ITEMNO</b>, BID, BID_DATE)

USERID is the primary key of the USERS table; ITEMNO is the primary key of the ITEMS table, and the combination of USERID and ITEMID is the composite primary key of the BIDS table.

The relational database tables correspond to input documents named users.xml, items.xml, and bids.xml. The correspondence between the tables and the XML documents is specified by the following Document Type Definition (DTD):

<!DOCTYPE users [

<!ELEMENT users (user_tuple*)>

<!ELEMENT user_tuple (userid, name, rating?)>

<!ELEMENT userid (#PCDATA)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT rating (#PCDATA)> ]>

<!DOCTYPE items [

<!ELEMENT items (item_tuple*)>

<!ELEMENT item_tuple (itemno, description, offered_by,

start_date?, end_date?, reserve_price?)>

<!ELEMENT itemno (#PCDATA)>

<!ELEMENT description (#PCDATA)>

<!ELEMENT offered_by (#PCDATA)>

<!ELEMENT start_date (#PCDATA)>

<!ELEMENT end_date (#PCDATA)>

<!ELEMENT reserve_price (#PCDATA)> ]>

<!DOCTYPE bids [

<!ELEMENT bids (bid_tuple*)>

<!ELEMENT bid_tuple (userid, itemno, bid, bid_date)>

<!ELEMENT userid (#PCDATA)>

<!ELEMENT itemno (#PCDATA)>

<!ELEMENT bid (#PCDATA)>

<!ELEMENT bid_date (#PCDATA)> ]>

The input data is contained in Table 3-1, Table 3-2, and Table 3-3.

TABLE 3-1 USERS

USERID

Name

Rating

U01

Tom Jones

B

U02

Mary Doe

A

U03

Dee Linquent

D

U04

Roger Smith

C

U05

Jack Sprat

B

U06

Rip Van Winkle

B

TABLE 3-2 ITEMS

ITEMID

Description

Offered_By

Start_Date

End_Date

Reserve_Price

1001

Red Bicycle

U01

1999-01-05

1999-01-20

40

1002

Motorcycle

U02

1999-02-11

1999-03-15

500

1003

Old Bicycle

U02

1999-01-10

1999-02-20

25

1004

Tricycle

U01

1999-02-25

1999-03-08

15

1005

Tennis Racquet

U03

1999-03-19

1999-04-30

20

1006

Helicopter

U03

1999-05-05

1999-05-25

50000

1007

Racing Bicycle

U04

1999-01-20

1999-02-20

200

1008

Broken Bicycle

U01

1999-02-05

1999-03-06

25

TABLE 3-3 BIDS

USERID

ITEMNO

Bid

Bid_Date

U02

1001

35

1999-01-07

U04

1001

40

1999-01-08

U02

1001

45

1999-01-11

U04

1001

50

1999-01-13

U02

1001

55

1999-01-15

U01

1002

400

1999-02-14

U02

1002

600

1999-02-16

U03

1002

800

1999-02-17

U04

1002

1000

1999-02-25

U02

1002

1200

1999-03-02

U04

1003

15

1999-01-22

U05

1003

20

1999-02-03

U01

1004

40

1999-03-05

U03

1007

175

1999-01-25

U05

1007

200

1999-02-08

U04

1007

225

1999-02-12

The XML representation of this tabular data serves as the input to the query. Following is a truncated version of the XML (truncated because the full XML files are lengthy, with nothing new beyond the first element in each):

<items>

<item_tuple>

<itemno>1001</itemno>

<description>Red Bicycle</description>

<offered_by>U01</offered_by>

<start_date>1999-01-05</start_date>

<end_date>1999-01-20</end_date>

<reserve_price>40</reserve_price>

</item_tuple>

<!-- !!! Snip !!! -->

<users>

<user_tuple>

<userid>U01</userid>

BookTom Jones</name>

<rating>B</rating>

</user_tuple>

<!--!!! Snip !!! -->

<bids>

<bid_tuple>

<userid>U02</userid>

<itemno>1001</itemno>

<bid>35</bid>

<bid_date>1999-01-07</bid_date>

</bid_tuple>

<bid_tuple>

<!-- !!! Snip !!! -->

Here’s one of the queries run against this data: List the item numbers and descriptions of all bicycles that currently have an auction in progress, ordered by item number. This query is expressed in XQuery as follows:

<result>

{

for $i in doc("items.xml")//item_tuple

where $i/start_date <= current-date()

and $i/end_date >= current-date()

and contains($i/description, "Bicycle")

order by $i/itemno

return

<item_tuple>

{ $i/itemno }

{ $i/description }

</item_tuple>

}

</result>

Assuming that the current date is 1999-01-31, the expected result of the query is

<result>

<item_tuple>

<itemno>1003</itemno>

<description>Old Bicycle</description>

</item_tuple>

<item_tuple>

<itemno>1007</itemno>

<description>Racing Bicycle</description>

</item_tuple>

</result>

The auctions for the old bicycle and the racing bicycle were active on the last day of January 1999.

FLWOR Expressions

FLWOR means for, let, while, order by, and return. FLWOR expressions are to XQuery what SELECT expressions are to SQL. They’re the constructs you use to ask questions of an XML document. A FLWOR expression must contain either a for clause or a let clause as well as a return clause; optionally, it may also include while and order by clauses. Here’s the syntax of a FLWOR expression:

FLWORExpr ::= (ForClause | LetClause)+ WhereClause? OrderByClause? "return" ExprSingle

ForClause ::= "for" "$" VarName TypeDeclaration? PositionalVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? "in" ExprSingle)*

PositionalVar ::= "at" "$" VarName

LetClause ::= "let" "$" VarName TypeDeclaration? ":=" ExprSingle ("," "$" VarName TypeDeclaration? ":=" ExprSingle)*

WhereClause ::= "where" ExprSingle

OrderByClause ::= ("order" "by" | "stable" "order" "by") OrderSpecList

OrderSpecList ::= OrderSpec ("," OrderSpec)*

OrderModifier ::= ("ascending" | "descending")? ("empty" "greatest" | "empty" "least")? ("collation" URILiteral)?

In the preceding code, the following conventions are used:

  • "A?" means that A is optional.
  • "A|B" means either A or B but not both.
  • "A+" means one or more occurrences of A.
  • "A*" means zero or more occurrences of A.

To see how to use a FLWOR expression on an example XML document that corresponds to an SQL table, consider this sample document:

<?xml version="1.0" encoding="UTF-8"?>

<customer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<row>

<FirstName>Abe</FirstName>

<LastName>Abelson</LastName>

<City>Springfield</City>

<AreaCode>714</AreaCode>

<Telephone>555-1111</Telephone>

</row>

<row>

<FirstName>Bill</FirstName>

<LastName>Bailey</LastName>

<City>Decatur</City>

<AreaCode>714</AreaCode>

<Telephone>555-2222</Telephone>

</row>

<row>

<FirstName>Chuck</FirstName>

<LastName>Wood</LastName>

<City>Philo</City>

<AreaCode>714</AreaCode>

<Telephone>555-3333</Telephone>

</row>

</customer>

Keep this example in mind as you make your way through the fors, lets, whiles, order bys, and returns in the following sections.

The for clause

A FLWOR expression must have at least a for clause (or a let clause) and a return clause. For purposes of illustrating the effect of the for clause, here are a for fragment and the result of that much of a FLWOR expression:

for $c in customer/row

This code selects the row element in the customer XML document. The result is

$c: <row>

<FirstName>Abe</FirstName>

<LastName>Abelson</LastName>

<City>Springfield</City>

<AreaCode>714</AreaCode>

<Telephone>555-1111</Telephone>

</row>

$c: <row>

<FirstName>Bill</FirstName>

<LastName>Bailey</LastName>

<City>Decatur</City>

<AreaCode>714</AreaCode>

<Telephone>555-2222</Telephone>

</row>

$c: <row>

<FirstName>Chuck</FirstName>

<LastName>Wood</LastName>

<City>Philo</City>

<AreaCode>714</AreaCode>

<Telephone>555-3333</Telephone>

</row>

The result includes three instances of the variable $c, each one being bound to a separate row element in the binding sequence from the original document. This result is called a tuple stream. The for clause iterates over the tuples in the binding sequence, binding the variable to each of the tuples in the sequence in turn.

The let clause

To show the effect of the let clause, I need a second document example to go along with the customer example. Here’s that document:

<?xml version="1.0" encoding="UTF-8"?>

<product xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<row rating="0">

<ProdNo>101</ProdNo>

BookFirecracker 1</Name>

<Size>Big</Size>

</row>

<row rating="1">

<ProdNo>102</ProdNo>

BookFirecracker 2</Name>

<Size>Huge</Size>

</row>

<row rating="3">

<ProdNo>103</ProdNo>

BookFirecracker 3</Name>

<Size>Tremendous</Size>

</row>

</product>

Whereas the for clause iterates over the items in the binding sequence, the let clause binds its variables with the entire sequence. A let clause that binds multiple variables generates a single tuple containing all the variable bindings. Here’s an example:

let $c := /customer/row, $p := /product/row/ProdNo

This code produces the following result:

$c: <row> $p: 101

<FirstName>Abe</FirstName> 102

<LastName>Abelson</LastName> 103

<City>Springfield</City>

<AreaCode>714</AreaCode>

<Telephone>555-1111</Telephone>

</row>

<row>

<FirstName>Bill</FirstName>

<LastName>Bailey</LastName>

<City>Decatur</City>

<AreaCode>714</AreaCode>

<Telephone>555-2222</Telephone>

</row>

<row>

<FirstName>Chuck</FirstName>

<LastName>Wood</LastName>

<City>Philo</City>

<AreaCode>714</AreaCode>

<Telephone>555-3333</Telephone>

</row>

The where clause

The result of a for clause or a let clause contains some data that you want, but it likely also contains some irrelevant data that you don’t want. You can filter out the data that you don’t want with a where clause. Consider the following XQuery fragment:

for $p at $i in product/row

where $i = $p/@rating

This code produces the following result:

$p: <row rating="3"> $i 3

<ProdNo>103</ProdNo>

BookFirecracker 3</Name>

<Size>Tremendous</Size>

</row>

Only in the case of the last product does the element counter equal the value of the rating attribute.

The order by clause

The order by clause, as you would expect, specifies an order for the items in the result, according to the ordering specification (OrderSpec). The ordering specification contains a single expression (ExprSingle) and an optional ordering modifier (OrderModifier). You can add an order by clause to one of the query fragments in the preceding sections to see how it affects the result, as follows:

for $c in customer/row

order by $c/LastName descending

This code selects the row element in the customer XML document, listing the tuples in descending order by last name. The result is

$c: <row>

<FirstName>Chuck</FirstName>

<LastName>Wood</LastName>

<City>Philo</City>

<AreaCode>714</AreaCode>

<Telephone>555-3333</Telephone>

</row>

$c: <row>

<FirstName>Bill</FirstName>

<LastName>Bailey</LastName>

<City>Decatur</City>

<AreaCode>714</AreaCode>

<Telephone>555-2222</Telephone>

</row>

$c: <row>

<FirstName>Abe</FirstName>

<LastName>Abelson</LastName>

<City>Springfield</City>

<AreaCode>714</AreaCode>

<Telephone>555-1111</Telephone>

</row>

Because you’re ordering by last name in descending order, the record for Chuck Wood precedes that of Bill Bailey, which comes before the record for Abe Abelson.

The return clause

The return clause specifies what’s returned by the FLWOR expression after all the other clauses have had a chance to have an effect. The ExprSingle in the return clause is evaluated once for each tuple produced by the combined activity of the other clauses. The result is a sequence of values. Adding a return clause to the query fragment in the preceding section to make a complete query results in the following:

<result>

{

for $c in doc("customer.xml")

order by $c/row/LastName descending

return

<row>

{ $c/FirstName }

{ $c/LastName }

</row>

}

</result>

This code selects the row element in the customer XML document, listing the tuples in descending order by last name. Then it outputs the first and last names in the tuples. The result is

<result>

<row>

<FirstName>Chuck</FirstName>

<LastName>Wood</LastName>

</row>

<row>

<FirstName>Bill</FirstName>

<LastName>Bailey</LastName>

</row>

<row>

<FirstName>Abe</FirstName>

<LastName>Abelson</LastName>

</row>

</result>

XQuery versus SQL

The data in an XML document is in quite a different form from the data in an SQL table, but some similarities exist between the two worlds. XQuery’s FLWOR expression corresponds to some extent with the SQL SELECT expression. Both expressions are used to retrieve the data you want from a collection of data, most of which you don’t want at the moment.

Comparing XQuery’s FLWOR expression with SQL’s SELECT expression

Although XQuery’s let clause has no analog in SQL, the XQuery for clause is related to the SQL FROM clause in that both specify the source of the data. XQuery’s order by clause serves the same purpose that SQL’s ORDER BY clause serves. In both XQuery and SQL, the where clause filters out data that you don’t want to include in the result. SQL’s GROUP BY and HAVING clauses have no analogs in XQuery.

Relating XQuery data types to SQL data types

Some of XQuery’s data types correspond to SQL data types; others don’t. XQuery has some data types that don’t correspond to any SQL data types, and vice versa.

Table 3-4 lists the XQuery data types and, where applicable, the corresponding SQL types. Where no corresponding type exists, a dash serves as a placeholder.

TABLE 3-4 XQuery 1.0 Data Types and Corresponding SQL Data Types

XQuery 1.0 Data Types

SQL Data Types

xs:string

CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT, NATIONAL CHARACTER, NATIONAL CHARACTER VARYING, NATIONAL CHARACTER LARGE OBJECT

xs:normalizedString

xs:token

xs:language

xs:NMTOKEN

xs:NMTOKENS

xs:Name

xs:NCNAME

xs:ID

xs:IDREF

xs:IDREFS

xs:ENTITY

xs:ENTITIES

xs:BOOLEAN

BOOLEAN

xs:decimal

NUMERIC, DECIMAL

xs:integer

INTEGER

xs:nonPositiveInteger

xs:negativeInteger

xs:long

BIGINT

xs:int

INTEGER

xs:short

SMALLINT

xs:byte

xs:nonNegativeInteger

xs:unsignedLong

xs:unsignedInt

xs:unsignedShort

xs:unsignedByte

xs:positiveInteger

xs:float

FLOAT, REAL

xs:double

FLOAT, DOUBLE

xs:error

xs:duration

xs:dateTime

TIMESTAMP WITH TIME ZONE, TIMESTAMP WITHOUT TIME ZONE

xs:date

DATE WITH TIME ZONE, DATE WITHOUT TIME ZONE

xs:time

TIME WITH TIME ZONE, TIME WITHOUT TIME ZONE

xs:gYearMonth

xs:gYear

xs:gMonthDay

xs:gDay

xs:gMonth

xs:hexBinary

BINARY LARGE OBJECT

xs:base64Binary

BINARY LARGE OBJECT

xs:anyURI

xs:QName

xs:NOTATION

xdt:dayTimeDuration

INTERVAL (day–time interval)

xdt:yearMonthDuration

INTERVAL (year–month interval)

xs:anyType

XML

xs:anySimpleType

xdt:untyped

Node types

Structured user-defined types (UDTs)

User-defined complex types

Structured UDTs

ROW

REF

List types and sequences

ARRAY

List types and sequences

MULTISET

DATALINK

Clearly, a lot more XQuery types are available than SQL types. In most cases, casting an SQL type to an XQuery type isn’t a problem, but going the other way may be a challenge.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.27.232