Just like DOM, XPath operates on a tree-based view of an XML document. The XPath tree is built of the same node types used in DOM, except that CDATA sections, entity references, and document type declarations are not directly addressable. Their content is, however; the net result is that you can navigate to a text node’s content, but you cannot tell whether that content contains plain text, CDATA, expanded entity references, or some combination thereof. You cannot access document type declarations at all with XPath.
For this discussion, I’ll return to the inventory example from Chapter 5. That example included an inventory database that looked similar to the one in Example 6-1; here I’ve added some additional products.
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE inventory SYSTEM "inventory.dtd"> <inventory> <!-- Warehouse inventory for Angus Hardware --> <date year="2002" month="7" day="6" /> <items> <item quantity="15" productCode="R-273" description="14.4 Volt Cordless Drill" unitCost="189.95" /> <item quantity="23" productCode="1632S" description="12 Piece Drill Bit Set" unitCost="14.95" /> <item quantity="10023" productCode="GN0250" description="1/4 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="9887" productCode="GN0375" description="3/8 inch Galvanized Steel Nails, 1/2 pound box" unitCost="189.95" /> <item quantity="8761" productCode="GN0500" description="1/2 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="3441" productCode="GN0625" description="5/8 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="9987" productCode="GN0750" description="3/4 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="10002" productCode="GN0875" description="7/8 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="596" productCode="GN1000" description="1 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> </items> </inventory>
To
introduce the proper terminology, each part of the XPath expression
is called a location step. Each location step is
made up of an axis, a node
test, and zero or more predicates.
Location steps are separated by the slash character
(/
).
The axis specifies the tree relationship between the nodes selected by the location step and the context node. Many axes have abbreviations which, while very convenient, are not always obvious to someone new to XPath. Table 6-1 shows the axes, their abbreviations, and brief descriptions of their meanings.
Axis |
Abbreviation |
Meaning |
|
Contains the immediate children of the context node. | |
|
|
Contains the immediate parent of the context node. |
|
. |
Contains the context node itself. |
|
|
Contains the attributes of the context node, if it is an element. |
|
Contains the parent of the context node, its parent, and so on, all the way up to the root node. | |
|
Contains the context node in addition to all the nodes contained in
the | |
|
Contains the children of the context node, their children, and so on, all the way down to the lowest level comment, element, processing instruction, and text node. It does not include attributes or namespaces. | |
|
// |
Contains the context node in addition to all the nodes contained in
the |
|
Contains all children of the context node’s parent node which appear before the context node. | |
|
Contains all children of the context node’s parent node which appear after the context node. | |
|
Contains all nodes which appear before the context node that are not ancestors. | |
|
Contains all nodes which appear after the context node that are not descendants. | |
|
Contains the context node’s namespace node. |
The node test specifies the type and
name of the nodes selected by the location step. Node tests include
text( )
, which selects the text content of the
context node; comment( )
, which selects all the
child nodes of the context node that are comments;
processing-instruction( )
, which selects all the
child nodes of the context node that are processing instructions; and
node( )
, which is the default, and selects all
children of the context node. The child
axis is
the default for any location step that does not have an explicit
axis.
A
predicate further refines the set of nodes selected by the location
step. Predicates can include selecting a specific element by
position, as well as functions like count( )
.
Predicates always appear in square brackets ([ ]
).
The double slash (//
) represents the expression
descendent-or-self::node( )
. The XPath query
//foo
would return all elements named
foo
anywhere in the document. While this is a very
powerful expression, it is also very inefficient, as it requires the
XPath processor to evaluate every node in the document to see if it
contains an element named foo
. It should be used
sparingly, and preferably within controlled contexts.
I’ll show you some of these terms in their proper context as we go along.
If
you have an XML document such as the inventory database in Example 6-1, you might wish to select certain nodes from
it. For example, you might want to know the date the inventory
numbers were recorded. The following XPath expression would return
the date
element:
/child::date
The double colon (:
:)
separates the axis from the element being selected. Since
child
is the default axis, this can also be
expressed in the abbreviated syntax:
/date
Every
XPath expression has a context node. The context
node is the node from which the search begins. In most cases, an
XPath implementation allows you to select the node you wish to use as
the context node. However, you can explicitly indicate that the
search is to begin from the root element by beginning the expression
with /
. Following the slash, the string
date
indicates that the expression is to return
all nodes that are descendants of the root node, and have the name
date
.
The XPath recommendation does not require a standard way to set the
XPath context node. In .NET, the XmlNode
object's SelectNodes( )
method, which I introduced
in Chapter 5, sets the context node to the
XmlNode
instance upon which you call the method.
For the
inventory document example, this expression would return the element
<date year="2002" month="7" day="6" />
. If
there are other nodes elsewhere in the tree with the name
date
, each of them would be returned as well. You
can make your search more specific by including only those nodes with
the name date
that are children of any node named
inventory
, using this expression:
/child::inventory/child::date
And again, this can be expressed with the abbreviated syntax:
/inventory/date
In much the
same vein, you could navigate to the items
element
with any of the following expressions; they can be considered
equivalent if the context node is the root element:
//child::inventory/child::items //inventory/items /inventory/items inventory/items
The single leading slash
(/
), as explained previously, is an axis that
indicates that the context node is to be ignored and the search is to
be done starting at the root. The double leading slash
(//
) has a slightly different meaning: at any
point within the expression, it indicates that the search is to
include the context node as well as all its descendants, although at
the beginning of the expression the double slash is equivalent to a
single slash. The expression with no leading slash indicates that the
search is relative to the context node.
//
is actually just an abbreviation for the
descendant-or-self::node( )/
axis. So another
equivalent to the expressions above would be:
descendant-or-self::node( )/inventory/child::items
This expansion and replacement of axes really could go on forever.
Once you have retrieved the items
element, you can
make it the context node for your next XPath expression. You can then
return the list of item
elements with this
expression:
item
You can
then iterate through each of these item
nodes,
doing as you wish with them.
If you have an item
element and wish to gather information about the inventory date, you
can use the double period axis (.
.), which is an
abbreviation for parent::node( )
. This axis
selects the parent of the current node. So, to get the
date
element from an inventory
element’s context, you could use this expression:
../../date
The double period can be used anywhere in the expression. For
example, you can combine some of the previous forms to return the
date
element in a fairly inefficient yet entirely
legal way. This sort of construct really comes into its own when you
start to build XPath expressions dynamically:
//item/../../date
It’s interesting to note that although
//item
would select all the
item
elements within the document,
//item/../../date
returns only the one
date
element. This is because XPath removes
duplicate nodes from the result set.
You can also select multiple elements at
once, with the pipe character (|
). The following
expression selects both the date
and
item
elements from the document:
//item|//date
XPath defines a special character to
select an attribute node. The at sign (@
) axis
indicates that the node to select is an attribute.
@
is an abbreviation for
attribute:
:. Attributes can be intermingled with
other nodes in the XPath expression. Thus, the following expression
selects the year
attribute of the
date
element:
//inventory/date/@year
And again, although it is an odd and somewhat inefficient way to do
it, you could select the month
attribute from any
element that has a year
attribute with this
expression:
//@year/../@month
You
can also use wildcards for element and attribute names. An asterisk
(*
) matches all element nodes, and
@*
matches all attribute nodes. This expression
returns all attributes for all elements:
//*/@*
Finally,
the node( )
function selects all nodes, of all
types.
You may find it helpful to expand the axis abbreviations into their
full axes as an aid to learning. For example,
//inventory/date/@year
is equivalent to
descendant-or-self::node(
)/child::date/attribute::year
, which, while specific, is
not exactly terse.
XPath also defines several functions to
select the other types of nodes. The first of these, text(
)
, selects any text node. The data returned will
concatenate all text, whitespace, CDATA, and entity references into a
continuous stream of characters, as long as there is no markup
separating them:
//text( )
Contrary to the XPath 1.0 recommendation, in .NET’s
XPath implementation, a CDATA section interrupts a text node. The
CDATA itself and any text following the CDATA will not be returned by
text( )
.
The comment( )
function selects comments. Each comment is returned as a separate
node, even if there is no text or markup between them:
//comment( )
As the name implies, the
processing-instruction( )
function selects
processing instructions:
//processing-instruction( )
With all the expressions you’ve seen so far, you can move up or down the node hierarchy at will, by inserting the appropriate axis. For example, you can select all the attributes of the parent nodes of any processing instructions with this expression:
//processing-instruction( )/../@*
However, there
are times when selecting all the elements or attributes with a
particular name is not enough. You may want to find all the elements
with a particular attribute value. For this purposes, XPath defines
predicates. The following expression selects any
item
elements that have a
productCode
attribute whose value is equal to
GN0500
:
//item[@productCode='GN0500']
You might also want to find all the items for which fewer than 10,000
units are in stock. The following XPath expression would discover
that, and select their description
attributes:
//item[@quantity<10000]/@description
XPath also supports the relational
operators <
, >
,
<=
, >=
, and
!=
, as well as and
and
or
. Most values are converted automatically to an
appropriate numeric or Boolean value, if the operator requires that
type.
Although there is a lot more included in the XPath recommendation, there is not room in this volume to list it all. If you’re interested in learning more about XPath, I recommend XML In a Nutshell (O’Reilly). If you want to learn about XPath in an XSLT context, take a look at XSLT (O’Reilly).
3.17.162.247