Preface

Why the subject matter is important

In a remarkably short period, XML has arguably become the most important language for marking up documents for the World Wide Web and for industry in general. Equally important, XML is rapidly becoming the lingua franca for marking up traditional business data, for exchanging information between business partners and between application programs, and for expressing a host of concepts that improve the usability of computer systems.

While it may be tempting to view XML as a “silver bullet” –a solution to all of our problems – the truth is a bit more prosaic: XML is merely a tool (admittedly a very important one) that can help solve a significant range of problems. Like most tools, XML introduces tradeoffs and complications. Among the difficulties that XML users will increasingly encounter are the ones posed by locating and retrieving information stored in documents marked up using XML.

As you’ll learn in this book, there are many approaches to querying XML documents and repositories of such documents. We cannot claim to have addressed every possible approach, or even every approach in use at the time we wrote this book. There are simply too many possibilities and alternatives, too many researchers and practitioners inventing new technologies. Instead, we have focused on the approaches that have the broadest uses, the largest community of adherents, and the greatest promise for economic success.

Before going further, we think that a quick explanation is in order for one key term that crops up repeatedly in this book: document. Because of XML’s origins, sequences of characters that follow the rules of XML, and are able to stand alone, are properly known as “XML documents”, even when they have nothing to do with books, articles, or any kind of textual material. When numeric data or even graphic images are represented in a standalone XML form, that XML is properly called an XML document. XML that cannot stand by itself is sometimes called an XML fragment. In general, throughout this book, we use the word “document” or “fragment” when a specific sort of XML is being referenced and we need to be clear about the nature of that XML. Otherwise, we mostly use the raw term “XML” and depend on the context to disambiguate our usage.

Why we wrote this book

“XML” is an enormous topic for any individual to understand. The term has come to imply much more than the markup language of the same name. Due in large part to the versatility of the markup language and the enormous utility of the Internet and the World Wide Web, there are countless computer scientists and software engineers developing specifications, tools, application programs, and even hardware that use or depend on some use of XML.

There are many fine books available that can teach you how to mark up your documents and your data with XML, how to use the extensible Stylesheet Language (XSL) to transform documents into other documents, how to use the many tools such as XML parsers and XSL transformation engines, and so forth. There are even several available books focused exclusively on XQuery, the almost-finalized W3C XML Query language.

But we have not seen any books that cover a broader subject that we think is vital: how to locate information in documents that are marked up using XML and how to find and extract that information in repositories of such documents. It is certainly important to mark up your documents and your data to capture the meaning inherent in them, but tremendous additional value is available when you can use powerful query facilities that not only find certain documents in a repository, but also find and extract the fine-grained information contained in those documents.

In this book, we identify and explore several approaches to querying XML documents, concentrating on those that we believe are most likely to be important in the near-to-medium future. We also give you a perspective on some of the other technologies that are closely related to the subject of querying XML. In doing so, we give you not only valuable insights about locating and retrieving information in XML documents, but we put the subject into the contexts in which it will be used.

Who should read this book

We wrote this book primarily to benefit software engineers who have to design and build applications that use XML and to access documents and data presented in an XML form. While the subject is necessarily technical in nature and presentation, we decline to focus exclusively on production of lines of code. Instead, we approach mastery of the subject by ensuring that readers understand the reason a particular topic is important, that they know the context in which the topic is relevant, that the principles of the topic are made clear, and that the details of writing code appropriate to the topic are illustrated and exemplified.

The book should be of interest to more than just software developers, though. Architects of software systems that use XML must know how search and retrieval issues are to be handled, while managers and team leaders need an understanding of the relationships between XML markup and storage and future retrieval of documents based on the semantics of the information they contain.

How the book is organized

This book is divided into several parts. Part I, “XML: Documents and Data”, starts off with a survey of structured document technology and examines several languages used to produce and/or represent such documents. It continues with an exploration of the problems associated with querying data generally, as well as with searching XML documents, and includes a comparison of querying XML with the use of SQL used to query traditional data.

Part II, “Metadata and XML”, introduces the subject of metadata for XML – information that describes XML documents and markup languages. This part covers Document Type Definitions (DTDs) and XML Schemas (with some attention given to competing XML schema definition languages). We discuss the “meaning” of XML markup and survey its use in a number of different XML-related markup languages. This part finishes with a presentation of XML’s Information Set (commonly known as the Infoset) and an introduction to several other data models used to describe XML documents in a formal manner.

Part III, “Managing XML for Querying”, looks at the different sorts of databases (e.g., relational, object-relational, object-oriented, and so-called “native XML”) in which XML documents are being stored. It also examines several other W3C specifications that play a role in XML documents that might be queried. This part of the book includes some information about a number of current products that are used to store, manage, query, and retrieve XML documents.

Part IV, “Querying XML”, is the technical heart of the book, describing four ways to query XML. XPath (the XML Path Language) is already an established language for querying within an XML document, so this part begins with a significant discussion of the XPath and its usage for XML querying. XQuery is a brand new language designed specifically for querying XML, so we will spend a lot of time and detail on it, including an analysis of the type system and data model used by that language, an examination of the formal semantics of the language, and a discussion (replete with examples) of the use of XQuery and its companion XQueryX. SQL is the leading query language for structured data today. We explore the ways that SQL can be used to query XML, especially if the XML is “shredded” and stored in an object-relational form. Finally, in this Part we discuss SQL/XML, a set of extensions to SQL that leverage XPath and XQuery to overcome some of SQL’s limitations in managing semi-structured data.

Part V, “Querying and the World Wide Web”, provides a look at a number of specific XML-based markup languages and responds to the question of whether XPath, XQuery, SQL, and/or SQL/XML are suitable for querying documents that are marked up using such languages or whether other, more specific, query facilities are needed to deal with them. It also looks at the ways in which XML is, and is going to be, used on the Internet, both for casual uses like browsing and for industrial uses such as data interchange between business partners. The impacts of internationalization on XML and related specifications are addressed here as well.

We finish up the book with appendices that give you a glimpse into the way in which open standards like XML, XQuery, and SQL/XML are developed, that contain the complete grammar of XQuery, that list and describe all of the SQL/XML functions, and that provides a lengthy set of examples and a small sample of data against which they have been tested.

The example we’re using

We are both avid fans of the cinema – which is illustrated by the fact that, between us, we subscribe to just about every possible movie channel offered by satellite television providers. Continuing the tradition started in earlier books written by Jim, we’ve chosen to use the subject of movies as the basis for our example. We’ve collected data on a broad range of films and organized it into a sort of “database” that is, in fact, a modestly large XML document. This document – data with XML markup – serves as the foundation for many of our examples. (Note that we do not pretend that our example document is marked up in any sort of optimal way, suitable for industrial use; we chose specific markup styles to illustrate the points we make at various parts of the book.) When the topic demands something a little less data-oriented, we use a smallish textual document that discusses several film-related topics.

Syntax Conventions

In several places in this book, we define the syntax of various language components relevant to XML, XML query languages, and so forth. While we are not particularly fond of the syntax conventions that the W3C has adopted (we find them somewhat less readable than several other conventions), we believe that readers of this book will be best served by consistency of style accompanied by explanations.

Therefore, we have (with slight reluctance) adopted the same style used in the W3C specifications that we reference in the book. You may be familiar with those conventions, but we think that a quick summary will help some readers.

A variation of Backus-Naur Form (BNF) is used for syntax presentation. More specifically, a syntactic symbol (called a nonterminal symbol to distinguish it from language components that represent only themselves) is defined using a notation in which the symbol being defined appears to the left of a special operator (: : =) and the definition of that symbol appears as an expression written following that operator. For example:

image

That line, called a BNF production, defines a nonterminal symbol (nonterminal-x) by saying that it is made up of a second nonterminal symbol (nonterminal–y), optionally followed by zero or more (that’s the meaning of the asterisk, *) repetitions of a sequence made up of a literal comma (that’s a terminal symbol) and another instance of that second nonterminal symbol (nonterminal–y).

Therefore, if nonterminal–y happens to be defined to be an identifier (in XML, these are either QNames or NCNames), then an instance of nonterminal-x might be:

image

One important thing to note is that, in this style of BNF, all terminal symbols are enclosed in quotation marks, which might be single quotation marks (‘…’) or double quotation marks (“…”). Anything, including parentheses, not enclosed in quotation marks is either a nonterminal symbol or a character used in the BNF to specify its meaning.

Here is a complete list of the conventions used in this book by this style of BNF:

• “string” – the literal string given inside the double quotes

• ‘string’ – the literal string given inside the single quotes

• a b – a single occurrence of a followed by a single occurrence of b

• a | b – a single occurrence of a or a single occurrence of b, but not both

• a? – a single occurrence of a or nothing at all; optional a

• a+ – one or more occurrences of a

• a* – zero or more occurrences of a

• (expression) – expression is treated as a unit; allows subgroups to carry the operators?, *, or +

• / * … * / – a comment in the BNF (this is unrelated to comments in languages being defined by the BNF, such as XQuery)

Additional resources

The data and queries in appendix A, plus additional examples and explanations, are available for download from the web site for this book’s examples, http://xqzone.marklogic.com/queryingxmlbook/. You may also visit http://www.mkp.com/QueryingXML for more information.

Type conventions

A quick note on the typographical conventions we use in this book seems in order:

• Type in this font is used for all ordinary text.

• Type in this font is used for terms that we define or for emphasis.

• Type in this font is used for all the examples, syntax presentations, keywords, identifiers, and XML text that appear in ordinary text.

Acknowledgements

Writing a book is an immense task and it consumes enormous quantities of resources such as energy, time for research and for writing, and often patience. A book like this one is quite difficult to produce, but difficult tasks often produce commensurately great rewards (financial rewards very rarely among them!). It’s exceedingly rare to do it alone – the help, guidance, and support of others is always appreciated: for ideas, for trying out concepts and wording, for reviewing paragraphs and whole chapters, and just for offering encouragement.

We want to give credit to all of the wonderful, talented people who have helped us create this book, especially the following people (alphabetized by their last names) who gave us extensive reviews, which heavily influenced the content and accuracy of this book.

• James Bean, author of “XML for Data Architects: Designing for Reuse and Integration” and “Engineering Global E-Commerce Sites”, both published by Morgan Kaurmann, and CEO of Relational Logistics Group.

• Alexander Falk, President and CEO of Altova, GmbH in Austria, and Altova, Inc. in the USA, who also generously provided us with licenses for Altova’s flagship Enterprise XML Suite.

• Muralidhar Krishnaprasad, our friend and colleague at Oracle, who seems to be an expert at all things related to XQuery, especially its implementation.

• Zhen Hua Liu, also our friend and colleague at Oracle, who is a driving force behind the implementation of SQL/XML and a constant source of valuable information and observations.

Of course, all remaining errors (and we harbor no illusions that we found and eliminated all errors in a subject as complex as this one) are solely our responsibility.

We also offer our deepest gratitude to the wonderful people at Morgan Kaufmann Publishers for their invaluable help and participation in the production of the book. Diane Cerra, our talented and patient editor, who trusted Jim enough to publish his first book, got us started on this book and came back to help us finish it. Two other editors, Lothlórien Homet and Rick Adams, worked with us for several months during the time when we were writing the most difficult chapters.

At various times during the lengthy writing process, Asma Stephan, Corina Derman, Mona Buehler, and Belinda Breyer made themselves available to answer our questions about schedules and production, to track down information that we managed to misplace, to make sure that our chapters were quickly reviewed by the right people, and to give us frequent and friendly reminders of approaching deadlines. Our production manager, Simon Crump, worked closely and patiently with us during the production process, making sure that our drafts were thoroughly copyedited and properly typeset, that our reviews of the galleys were applied to the typeset draft, and that all production errors were promptly handled. Brent dela Cruz, our marketing manager, bears the burden of ensuring that this book is made available to you, our readers. To Diane, Asma, Simon, Brent, and all of the other fantastic people at Morgan Kaufmann, thanks!

Credit must also be given to the incredible group of people who make up the various W3C Working Groups responsible for the specifications discussed in this book. The languages and facilities related to querying XML documents include XML Query (co-chaired by Jim’s long-time friend and colleague Andrew Eisenberg), XSL (chaired by the delightful Sharon Adler), and XML Schema (first chaired by one of the most generous and smartest people around, Michael Sperberg-McQueen, and now chaired by our good friend David Ezell, who is proving to be remarkably good at herding cats), among others.

We are particularly grateful to our friends who offered suggestions that certainly improved the content and focus of the book. They include Ashok Malhotra, Andrew Eisenberg, Murali Krishnaprasad, and Zhen Hua Liu.

Finally, we want to express our appreciation to Don Chamberlin for writing the Foreword to this book. Don wrote the Foreword for Jim’s first SQL book and it feels like we’ve reached a sort of closure, coming full circle on SQL and starting a new circle for the next major query language.

Jim: I give special thanks to my wonderful partner, best friend, and spouse, Barbara Edelberg. She took up all the slack when I was stuck at the computer ‘til all hours of the night, writing. Barbara had to deal with me on the road and unavailable so much of the time. It was Barbara’s emotional support and encouragement, as I agonized over every sentence in the book, that got me through it. I also owe a debt of gratitude to my co-author, friend, and backpacking buddy, Stephen Buxton, for stepping in to write the book with me – he joined me just as I was falling into despair at the magnitude of the task and the difficulty of writing this book while doing my “day job”.

Stephen: I’d like to say thank you to my family for their support and encouragement – my kids Maria and Samuel, and my other “kids” Jennie and Sarah, and most of all, my lovely wife Veronica (“I thought you said it was finished!”), who has stuck with me through many, many late nights and weekends. I’d also like to thank my co-author, erstwhile colleague, and very good friend Jim Melton for guiding me through my first authoring experience. Thanks Jim!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.99.71