Chapter 2

Querying

2.1 Introduction

In Chapter 1 we discussed the second term of the title of this book – “XML.” In this chapter we give some background on “querying,” before introducing “querying XML” in Chapter 3. We describe the query problem and some ways that problem is addressed today. In this chapter we focus on the issues that are common to all query scenarios, and we focus on SQL as a query solution. In the next chapter you will see how those issues can be addressed when the data is XML, and we will describe some wrinkles that are unique to querying XML.

2.1.1 Definitions of Query

Let’s start with three definitions of the word query. In everyday English, to query means “to ask questions of, especially with a desire for authoritative information.”1 We like this definition because it points up the precise nature of (most) queries – when you query a database, you don’t expect to get back an educated guess at, say, the total sales of each movie in the last calendar year. On the contrary, you expect a precise, authoritative answer.

Our second definition talks about querying databases and introduces the notion of a query language. “[databases] provide a means of retrieving records or parts of records and performing various calculations before displaying the results. The interface by which such manipulations are specified is called the query language.”2 This definition also brings out the need for a query to not only return a record or set of records but also to bring back parts of records and to manipulate (compare, aggregate, transform, etc.) that data.

Finally, here’s a more pedantic definition:

In general, a query (noun) is a question, often required to be expressed in a formal way. The word derives from the Latin quaere (the imperative form of quaerere, meaning to ask or seek). In computers, what a user of a search engine or database enters is sometimes called the query. To query (verb) means to submit a query (noun). A database query can be either a select query or an action query. A select query is simply a data retrieval query. An action query can ask for additional operations on the data, such as insertion, updating, or deletion.3

This definition emphasizes the importance of update as a part of query (see Chapter 13, “What’s Missing?”). It also talks about query in terms of both search engines and databases. We discuss the broader notions of search (as opposed to query) in Chapter 18, “Finding Stuff.” In the rest of this book, however, we consider query to be the more formal kind of query that you might pose to a database or other query application – finding things that you know exist and/or retrieving information that you need in order to do some task.

2.2 Querying Traditional Data

In this section we discuss querying simple data types, such as integers, dates, and short strings, that can be easily represented in simple structures such as rows and columns in a table. In the rest of this book, we will refer to this as traditional data. Section 2.3 discusses querying nontraditional data.

For the past two decades the most popular way of querying data has been SQL, the SQL Query Language. Most of the world’s critical data is stored in a relational database, and most users and applications employ SQL to find, retrieve, and manipulate that data. So SQL defines the benchmark (or gold standard) for querying data – any new approach to querying data must either do at least all the things that SQL does or provide a good reason for not doing those things. That’s why we focus on SQL in this section and the next.

A relational database (a SQL database) stores data in tables and allows search across those tables with SQL. SQL is particularly good at querying traditional data (though you will read in Section 2.3 that SQL has been extended to query nontraditional data, too).

2.2.1 The Relational Model and SQL

We have already seen (Figure 1-1) that the movies data can be stored in a set of tables. The Relational Model,4 first proposed by Dr. Ted Codd5 and developed in collaboration with Chris Date and others, includes the notion of tables, columns, and rows (or relations, attributes, and tuples). A table can be viewed as a grid of rows and columns, where each row-column intersection (or cell) contains a single data item. Each column has a data type (integer, character, date, etc.). The Relational Model also defines a relational algebra, with operations on tuples (rows in a table, or intermediate query results).6 The most important operations are projection, selection, union, and join.

A projection produces only some of the columns of the table, by naming those columns in the SELECT clause. See Example 2-17 (where the result of the query is the contents of the shaded column).

A selection produces only some of the rows of the table, by filtering the results using some predicate. See Example 2-2 (where the result of the query is the contents of the shaded rows).

Of course, these operations can be composed in an ad hoc way – see Example 2-3 for an example of selection and projection together (where the result of the query is the contents of the shaded cells).

Union and join combine data from two or more tables. Union combines data vertically (appends rows from one table onto rows of another), while join combines data horizontally (appends columns from one table onto columns of another, usually on the basis of the value of one of the columns).

Example 2-1   Projection

image

Result:

image

Example 2-2   Selection

image

Result:

image

Example 2-3   Projection and Selection

Result:

image

image

Example 2-4   Union

image

Result:

ID familyName givenName
78 Landis John
79 Christopher Guest
80 Scott Ridley
44 Folsey George, Jr.
45 Guber Peter
46 Murphy Karen
47 Simmons Matty
48 Reitman Ivan
49 Deeley Michael

Example 2-5   Join

image

Result:

image

Relational databases and the SQL Query Language have been enormously successful. SQL has set the bar for query languages, so that any general-purpose query language for XML will be expected do at least what SQL does for relational data. That said, the Relational Model does have its limitations. Many of these limitations are addressed by object-relational storage and the object extensions to SQL, described in the next section.

2.2.2 Extensions to SQL

One of the major criticisms of the Relational Model is that its data model imposes a rather simplistic structure on the data. The Relational Model represents all data as tables (rows and columns) of cells, where each cell contains a single data item. While much of the world’s data fits quite neatly into this model, a lot of data simply does not fit.

Rows and Columns with Single-Value Cells – Too Simplistic

In Figure 1-1 we could not represent all the data about one movie in one row of the MOVIES table. The director, the producer, and arguably the title do not fit into a single relational cell. The director does not fit because the director data consists of two (and possibly many) data items, firstName and givenName. The producer data also consists of more than one data item, plus there is a many-to-many relationship between producers and movies – not only can there be many movies attributed to the same producer, but there may be many producers associated with each movie. The status of the title data is more controversial – it seems to be a single data item of type string, but if we want to do anything really useful with it we need to consider it as (at least) a sequence of words. We will pursue this further in Chapter 13, “What’s Missing?.”

Object-Relational Storage

In the 1990s the advent of object-oriented database management systems (OODBMSs) caused a huge stir, with many predicting the end of the road for relational database management systems (RDBMSs). Some said that the Relational Model was so limited that relational databases would disappear entirely in favor of object-oriented databases. What has happened instead is that all the major relational database vendors have implemented object extensions, so they are now object-relational8 database management systems (ORDBMSs).9

In an object-relational database, we can represent complex structure as an object type, or class. An object type may include simple types, and it also includes methods – operations that can be performed on that type. Now, instead of a single data item appearing in a cell, we can have an instance of an object type.

In our movie example, we might add a director object type, with methods that return the director’s name (familyName + given-Name) as a single string in some natural format. Stonebraker’s book on ORDBMSs (mentioned earlier) gives more compelling examples of much more complex object types, to represent data defining two-dimensional spatial objects, image files in many formats, and time series data (which is used by stock traders to analyze trends in stock prices). Since Stonebraker’s book was published, life sciences has emerged as an important field, where complex objects representing such things as DNA are needed. There are, of course, a lot of usage scenarios that fall in between these extremes of complexity – many applications today model real-world things like “customers” and “purchase orders” as objects. And in the XML world, the DOM (Document Object Model) is an object representation of an XML document (see Chapter 6, “The XML Information Set (Infoset) and Beyond,” for a brief description of the DOM).

Object Extensions to SQL

With SQL:199910 (formerly referred to as SQL3), the SQL standard embraced object-oriented technology with a set of extensions to SQL-92. Extensions include support for:

• user-defined types

• type constructors for row types, reference types, and collection types

• user-defined functions and procedures

• LOBs (large objects)

With these extensions, a SQL:1999 database user can define, manipulate, and query objects in SQL. Database vendors have also implemented their own object extensions, in addition to those defined in the SQL:1999 standard.

2.2.3 Querying Traditional Data – Summary

In this section we introduced the notion of traditional data, defined informally as “numbers, dates, and short strings.” We introduced the Relational Model and SQL, which are still the gold standard for storing, representing, manipulating, and querying this kind of data. Then we introduced object-relational technology, which allows traditional data with a rich structure, and briefly summarized the SQL object extensions in SQL:1999.

A lot of the data that is stored and represented as XML today is traditional data. Interestingly, some of the major database vendors’ first forays into XML and XQuery support are based on an object-oriented approach, presumably leveraging their existing object infrastructure and capabilities.

2.3 Querying Nontraditional Data

At least 90% of the data in the world is nontraditional data, and a great deal of valuable information is locked up in Word files,11 PowerPoint presentations, PDF documents, diagrams, and so on. In the last 10 years or so, as the problems of storing and querying traditional data have been largely solved by SQL and its object extensions, the database industry has turned to solving the problem of querying nontraditional data. We define nontraditional data informally as data that cannot be represented naturally as numbers and dates and strings, such as documents and pictures and movie clips. We hesitate to call it “unstructured,” since everything has some structure. And it is not always binary (though it often has some binary component).

In this section, we describe three approaches to querying nontraditional data – metadata, objects, and markup. Let’s assume that the movie American Werewolf in London has:

• a preview clip, AmericanWerewolflnLondon-preview.mpg

• a radio ad, AmericanWerewolfInLondon-RadioAd.mp3

• a poster, AmericanWerewolfInLondon-Poster.jpg

• a review, AmericanWerewolfInLondon-review.pdf

Let’s also assume that we want to be able to store and query all that nontraditional data, along with the movies data we already have.

2.3.1 Metadata

One approach is to store the nontraditional data as an opaque chunk of data and add metadata. In a database this opaque chunk is often called a BLOB, or binary large object. Despite its name, a LOB has none of the useful attributes of an object. LOB storage just means that the data item is stored as a single item in one place. A binary LOB is a LOB that contains data that is not character-based, as opposed to a character LOB, or CLOB, which contains data that is in the character set of the database. A LOB is handled in a special way by the database to cope with its potentially large size – typically two or four gigabytes. But the data is opaque; i.e., nothing is known about a LOB instance except that it may be large.

Once we have the data in a LOB, we can store it in a database table and add metadata in other columns in the table. Then we can query the metadata to find a particular instance of a LOB or to find out information about an instance of a LOB.

There are several ways to create the metadata:

• Some formats have metadata embedded in them. Text formats (PDF, Microsoft Word) generally contain some automatically generated metadata – the author’s name, document title, last modified date, etc. – and some metadata that can be added by the author. This metadata can be extracted programmatically and written into database columns as the data is inserted. For example, Oracle’s inter Media product will extract metadata from most common document, audio, video, and image formats and make that metadata available for query in columns of a table. But the LOB is still opaque – some processing needs to be done to “decorate” it with metadata, even if that metadata exists in the LOB.

• Whoever publishes the data – inserts the data into a database – can add metadata via an application. A CMS (content management system) will allow the publisher to add all kinds of metadata at various stages of the publishing process.12

• Some interesting program13 can produce meaningful metadata for text documents automatically, even when that metadata does not exist explicitly in the document. They work by recognizing names of people and companies and so on and possibly checking those names against an internal dictionary or a web service. At the time of writing, this technology – automatic entity extraction – is still in its infancy.

    Once the metadata columns have been populated, you can query the metadata to find, or find out about, the LOB data – e.g., think of the movie example as metadata for the actual movie.

2.3.2 Objects

Object technology offers the potential of storing nontraditional data items in a nonopaque way – to “open the box” and treat, say, a PDF document as a PDF document rather than as an opaque LOB. All we need to do is define an object type to represent the PDF-formatted data, and some methods that make sense for PDF. Then we can query the actual document instead of querying its metadata.

In Section 2.3.1 we said that Oracle’s interMedia can extract metadata that is embedded in, e.g., a picture. interMedia does that by creating an object type for the picture format and querying that object to extract useful metadata.

2.3.3 Markup

We have discussed decorating nontraditional data with metadata and querying nontraditional data directly using object technology. Both approaches require some special, nonstandard effort. The metadata approach requires manual or programmatic effort to produce the metadata, then some design to figure out how to store the metadata, and finally some programming to create an application that will query the metadata in an application-specific way. The objects approach requires the definition of an object type, with methods, for each kind of data to be queried, plus an application to query that data.

You can achieve similar results with markup, but you end up with an XML document that can be described and queried with standard tools. That means you can leverage existing tools and skills and communicate your efforts easily to other companies, institutions, or software programs.

You can, of course, use markup to represent metadata that has been created or extracted as described in Section 2.3.1. Adobe has taken this approach with their XML Data Packet Specification, part of the Adobe XML Architecture.14 The specification defines a format for wrapping a PDF document in XML tags to make a data packet that can be consumed by anything that understands XML. Any XML that is embedded in the PDF document can be extracted and packaged as a separate packet, while the bulk of the PDF document is encoded in base 64.

A bolder approach is to define an XML markup language that will allow you to represent the nontraditional data “natively” as XML. In the document world, the de facto standard way of representing a technical paper or article as XML is DocBook (though of course there are many XML languages that are regarded as standard in some domain). Microsoft has made a lot of progress in this area – with Microsoft Office 2003, you can save any Office document as XML. Microsoft have introduced several markup languages, including WordML and ExcelML, to describe the XML structure of Office documents that are represented as XML.15

Even more adventurous are the attempts to define markup languages for other media types. Probably the most advanced is scalable vector graphics (SVG),16 which became a W3C recommendation in January of 2003. SVG defines a way of representing rich graphical content (consisting of vector graphic shapes, images, and text) in XML.

2.3.4 Querying Content

So far in this section on querying nontraditional data we have described ways to extract (or at least surface) traditional, structured data that tells us something about the nontraditional content so that we can query that traditional data. There is another approach – to query the nontraditional data directly, in a way that is appropriate to that kind of data.

The obvious example is to query text data using full-text queries. We discuss full-text querying in Chapter 13, “What’s Missing?” There are analogous query mechanisms for other kinds of nontraditional data. For example, if you want to query across images, you can extract some metadata (size, date, exposure, etc.); represent that metadata in a database column or an object attribute or as markup; and query that metadata. Or you can directly query images according to their similarity to some given image (a query by example) or potentially by giving a verbal description of the image you are looking for. You might compare images along several dimensions – texture, colors, shapes, etc. – so that an example image of a brown cow might bring back other cows, other brown things, or other brown cows. This kind of multimedia search, looking for aspects of audio, video, and image, has enormous potential – think of matching a face on the CCTV screen at an airport, or looking for close matches to the chorus of “My Sweet Lord” to check for possible copyright infringement, or sorting movies by the number of car chases. But direct multimedia search is outside the scope of this book, except for full-text search.

2.4 Chapter Summary

In this chapter we discussed the meaning of query and query language. We described what a query is and what it needs to be able to do, and we introduced SQL, the SQL Query Language, as the gold standard for any query language. We also discussed the different challenges in querying traditional and nontraditional data. Now that we know what XML is and what querying is, the next chapter will discuss the challenges of “querying XML.


1Merriam-Webster Online Dictionary. Available at: http://www.merriam-webster.com.

2Encyclopedia Brittanica Online. Available at: http://www.britannica.com.

3Whatis.com. Available at: http://www.whatis.com.

4Note that the Relational Model is not the only way to organize data in a database. Before the Relational Model was defined, databases were generally implemented using the Hierarchical Model or the Network Model. More recently, some have favored the Object Model of database implementation.

5E. F. Codd, A relational model of data for large shared data banks, Communications of the ACM 13(6), 377–387 (1970). Available at: http://www.acm.org/classics/nov95/toc.html

6For an overview of the Relational Model and an in-depth description of SQL:1999, see: Jim Melton and Alan Simon, SQL:1999 – Understanding Relational Language Components (San Francisco: Morgan Kaufman: 2001).

7The data for each of these examples is taken from Chapter 1. We have added some rows to the data so that the examples make sense.

8For further reading on the benefits of object-relational databases, see Michael Stonebraker, Object-Relational DBMSs, The Next Great Wave (San Francisco: Morgan Kaufmann, 1996).

9There is an obvious analogy here with XML databases. Just a few years ago, some pundits were rash enough to forecast the demise of RDBMSs in favor of XML databases – instead, the major RDBMS vendors are adding XML and XQuery capabilities, so their products might be called ORXDBMSs (object-relational-XML database management systems).

10For an in-depth discussion of the object-relational extensions in SQL:1999, see Jim Melton, Advanced SQL:1999 – Understanding Object-Relational and Other Advanced Features (San Francisco: Morgan Kaufmann, 2002).

11Microsoft plans to make XML the default format in the next version of Microsoft Office (see: Microsoft Office Open XML Formats Overview, June 2005. Available at: http://www.microsoft.com/office/preview/fileoverview.mspx). And Adobe is moving toward an XML format for (some) PDF files (see: The Adobe XML architecture. Available at: http://www.adobe.com/enterprise/xml.html). So there’s a real prospect that XML will open up the most common document formats, making them accessible to XQuery.

12But beware of GIGO – garbage in, garbage out. Manual metadata systems are notorious for producing minimal or useless metadata. When was the last time you filled out the Properties sheet on a Microsoft Word document you were writing?

13Some examples of companies that offer entity-extraction technology: Basis – http://www.basistech.com/entity-extraction/Inxighthttp://www.inxight.com/products/smartdiscovery/ee/ClearForesthttp://www.clearforest.com/Products/Tags.asp

14Adobe XML Architecture, XML Data Packet Specification, Version 2.0 (2003). Available at: http://partners.adobe.com/public/developer/en/xml/xdp_2.0.pdf.

15Microsoft’s website has many articles on Office 2003 XML support, e.g., Dave Beauchemin, Exploring XML in the Microsoft Office System (Redmond, WA: Microsoft, 2003). Available at: http://www.microsoft.com/office/previous/xp/columns/column21.asp.

16Scalable Vector Graphics (SVG) 1.1 (Cambridge, MA: World Wide Web Consortium, 2003). Available at: http://www.w3.org/TR/SVGll/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.239.166