Chapter 1 provided sketchy information about using XPath. For the remainder of the book, you’ll get details aplenty. In particular, this chapter covers the most fundamental building blocks of XPath. These are the “things” XPath syntax (covered in succeeding chapters) enables you to manipulate. Chief among these “things” are XPath expressions, the nodes and node-sets returned by those expressions, the context in which an expression is evaluated, and the so-called string-values returned by each type of node.
You’ll learn much more about nodes in this chapter and the rest of the book. But before proceeding into even the most elementary details about using XPath, it’s essential that you understand what, exactly, an XPath processor deals with.
Consider this fairly simple document:
<?xml-stylesheet type="text/xsl" href="battleinfo.xsl"?> <battleinfo conflict="WW2"> <name>Guadalcanal</name> <!-- Note: Add dates, units, key personnel --> <geog general="Pacific Theater"> <islands> <name>Guadalcanal</name> <name>Savo Island</name> <name>Florida Islands</name> </islands> </geog> </battleinfo>
As the knowledgeable human eye — or an XML parser — scans this document from start to finish, it encounters signals that what follows is an element, an attribute, a comment, a processing instruction (PI), whatever. These signals are of course the markup in the document, such as the start and end tags delimiting the elements.
XPath functions at a higher level of abstraction than this simple kind of lexical analysis, though. It doesn’t know anything about a document’s tags and thus can’t communicate anything about them to a downstream application. What it knows about, and knows about intimately, are the nodes that make up the document: the discrete chunks of information encapsulated within and among the markup. Furthermore, it recognizes that these chunks of information bear a relationship to one another, a relationship imposed on them by their physical arrangement within the document. (such as the successively deeper nesting of elements within one another) Figure 2-1 illustrates this node-tree view of the above document as seen by XPath.
There a few things to note about the node tree depicted in Figure 2-1:
First, there’s a hierarchical
relationship
among the different “things” that
make up the tree. Of course, all the nodes are contained by the
document itself (represented by the overall figure). Furthermore,
many of the nodes have “offshoot”
nodes. The battleinfo
element sits on top of the
outermost name
element, the comment, and the
geog
element (which are all in turn subordinate to
battleinfo
).
Some discrete portions of
the original
document contribute to the hierarchical nature of the tree. The
elements (solid boxes) and their offspring — subordinate
elements, text strings (dashed boxes), and the comment — are
connected by solid lines representing true hierarchical
relationships. Attributes, on the other hand, add nothing to the
structure of the node tree (although they do have relationships,
depicted with dotted-dashed lines, to the elements that define them).
And the xml-stylesheet
PI at the very top of the
document is connected to nothing at all.
Finally, most subtly yet most importantly, there is not a single scrap of markup in this tree. True enough, the element, attribute, and PI nodes all have names that correspond to bits of the original document’s markup (such as the elements’ start and end tags). But there are no angle brackets here. All that the XPath processor sees is content, stacked inside a tower of invisible boxes. The processor knows what kind of box each thing is, and if applicable it knows the box’s name, but it does not see the box itself.
If you’ve never worked with XPath before, you may be expecting its syntax to be XML-based. That’s not the case, though. XPath is not an XML vocabulary in its own right. You can’t submit “an XPath” to an XML parser — even a simple well-formedness checker — and expect it to pass muster. That’s because “an XPath” is meant to be used as an attribute value.
Chapter 1 discussed why using XML syntax for general-purpose languages, such as XPath and XPointer, is impractical. As mentioned there, the chief reason might be summed up as: such languages are needed in the context of special-purpose languages, such as XSLT and XLink. Expressing the general-purpose language as XML would both make them extremely verbose and require the use of namespaces, complicating inordinately what is already complicated enough.
“An XPath”[1] consists of one or more chunks of text, delimited by any of a number of special characters, assembled in any of various formal ways. Each chunk, as well as the assemblage as a whole, is called an XPath expression.
Here’s a handful of examples, by no means comprehensive. (Don’t fret; there are more detailed examples aplenty throughout the rest of the book.)
taxcut
Locates an element, in some relative context, whose name is “taxcut”
/
Locates the document root of some XML instance document
/taxcuts
Locates the root element of an XML instance document, only if that element’s name is “taxcuts”
/taxcuts/taxcut
Locates all child elements of the root taxcuts
element whose names are “taxcut”
2001
The number 2001
"2001"
The string “2001”
/taxcuts/taxcut[attribute::year="2001"]
Locates all child elements of the root taxcuts
element, as long as those child elements are named
“taxcut” and have a
year
attribute whose value is the string
“2001”
/taxcuts/taxcut[@year="2001"]
Abbreviated form of the preceding
2001 mod 100
Calculated remainder after dividing the number 2001 by 100 (that is, the number 1)
/taxcuts/taxcut[@year="2001"]/amount mod 100
Calculated remainder after dividing the indicated
amount
element’s value by 100
substring-before("ill-considered", "-")
The string “ill”
Chapter 3 details both of these concepts. To get you started in XPath, here’s a broad outline.
Most XPath expressions, by far, locate a document’s contents or portions thereof. (Expressions such as the number 2001 and the string “2001” are exceptions; they don’t locate anything, you might say, except themselves.) These pieces of content are located by way of one or more location steps — discrete units of XPath “meaning” — chained together, usually, into location paths.
This XPath expression from the above list:
/taxcuts/taxcut
consists of two location steps. The first locates the
taxcuts
child of the document root (that is, it
locates the root element); the second locates all children of the
preceding location step whose names are
“taxcut.” Taken together, these two
location steps make up a complete location path.
As you can see from the previous examples, an XPath expression can be said to consist of various components: tokens and delimiters.
A token, in XPath as elsewhere in the XML world, is a simple, discrete string of Unicode characters. Individual characters within a token are not themselves considered tokens. If an XPath expression is analogous to a chemical molecule, the tokens of which it’s composed are the atoms. (And the individual characters, I guess, are the sub-atomic particles.)
If quotation marks surround
the token, it’s assumed
to be a string. If no quotation marks adorn the token, an XPath-smart
application assumes that the token represents a node
name.[2]
I’ll have more to say
about nodes and their names in a
moment and much more to say about them throughout the rest of the
book. For now, though, consider the first example listed above. The
bare token taxcut
is the name of a node. If I had
put it in quotation marks, like "taxcut"
, the
XPath expression wouldn’t necessarily refer to
anything in a particular document; it would simply refer to a string
composed of the letters t, a, x, c, u, and t: almost certainly not
what you want at all.
As a special case, a node name can also be represented with an asterisk
(*
). This serves as a wildcard (all nodes,
regardless of their name) character. The expression
taxcut/*
locates all elements that are children of
a taxcut
element.
Tokens in an XPath expression are set off from one another using single-character delimiters, or pairs of them. Aside from quotation marks, these delimiters include:
/
A forward slash separates a location step from the one that follows it. While I introduced location steps briefly above, Chapter 3 will discuss them at length.
[
and ]
Square brackets set off a predicate from the preceding portion of a location step. Again, detailed discussion of predicates is coming in Chapter 3. For now, understand that a predicate tests the expression preceding it for a true or false value. If true, the indicated node in the tree is selected; if false, it isn’t.
=
, !=
, <
, >
, <=
, and >=
These Boolean “delimiters” are
used in a predicate to establish the
true or false value of the test. Note that when used in an XML
document, the markup-significant <
and
>
characters must appear in their escaped forms
to comply with XML’s well-formedness constraints,
even when used in attribute values. (For instance, to use the Boolean
less-than-or-equal-to test, you must code the XPath expression as
<=
.) While XPath itself
isn’t expressed as an XML vocabulary, the documents
in which XPath expressions most often appear are
XML documents; therefore, well-formedness will haunt you in XPath
just as elsewhere in the XML world.[3]
::
A double colon separates the
name of an axis type from the
name of a specific node (or set of nodes). Axes (more in Chapter 3) in XPath, as in plane and solid geometry,
indicate some orientation within a space. In an XPath expression, an
axis “turns the view” from a given
starting point in the document. For instance, the attribute axis
(abbreviated @
) looks only at attributes of some
element or set of elements.
//
, @
, .
, and ..
Each of these — the double slash, at sign, period, and double period — is an abbreviated or shortcut form of an axis or location step. Respectively, these symbols represent the concepts of descendant-or-self, attribute, self, and parent (covered fully in Chapter 3).
|
A pipe/vertical bar in an XPath expression functions as a Boolean union operator. This lets you chain together complete multiple expressions into compound location paths. Compound location paths are covered at the end of Chapter 3.
(
and )
Pairs of parentheses in XPath expressions, as in most other computer-language contexts, serve two purposes. They can be used for grouping subexpressions, particularly in cases where the ungrouped form would introduce ambiguities, and they can be used to set off the name of an XPath function from its argument list. Details on XPath functions appear in Chapter 4.
+
, -
, *
, div
, and mod
These five “delimiters” actually
function as numeric operators: ways of combining numeric values to
calculate some other value. Numeric operators are also covered in
Chapter 4. Note that the asterisk can be used as
either a numeric operator or as a wildcard character, depending on
the context in which it appears. The expression
tax*income
multiplies the values of the
tax
and income
elements and
returns the result; it does not locate all
elements whose names start with the string
“tax” and end with the string
“income.”
When not appearing
within a string, whitespace can in
some instances delimit tokens (and even other delimiters) for
legibility, without changing the meaning of an expression. For
instance, the two predicates [@year="2001"]
and
[@year =
"2001"]
are
functionally identical, despite the presence in the second case of
blank spaces before and after the =
. Because the
rules for when you can and can’t use whitespace vary
depending on context, I’ll cover them in various
places throughout the book.
While the rules for valid combinations of tokens and delimiters aren’t spelled out explicitly anywhere, they follow the rules of common sense. (Whether the sense is in fact common depends a little on how comfortable you are with the concepts of location steps and location paths.)
For instance, the following is a syntactically illegitimate XPath expression; it also, if you think a little about it, doesn’t make practical sense:
book/
See the problem? First, for those of you who simply want to follow
the rules without thinking about them, you can simply accept as a
given that the /
(unless used by itself) must be
used as a delimiter between location steps; with no subsequent
location step to the right, it’s not separating
book
from anything.
Second, there’s a more, well, let’s
call it a more philosophical problem. What exactly would the above
expression be meant to say? “Locate a child of the
book
element which....” Which
what? It’s like a sentence fragment.
Note the difference here between XPath expressions and their
counterparts in some other
“navigational” languages, such as
Unix directory commands and URIs. In these other contexts, a trailing
slash might mean "all children
of the present context” (such as a directory) or
“the default child of the
present context” (such as a web document named
index.html or
default.html). In XPath, few of these matters
are implicit. If you want to get all children of the current context,
follow the slash with something, such as an asterisk wildcard (to get
all named children), as in book/*
. Chapter 3 describes other approaches, particularly the
use of the node( )
node test.
I’ll cover these kinds of common-sense rules where appropriate. (See Chapter 3, especially.)
A careful reading of the previous material about XPath expressions should reveal that XPath is capable of processing four data types: string, numeric, Boolean, and nodes (or node-sets).
The first three data types I’ll address in this section. Nodes and node-sets are easily the most important single XPath data type, so I’ve relegated them to a complete section in their own right, following this one.
You can find two kinds
of
strings, explicit and implicit, in nearly any XPath expression.
Explicit (or literal) strings, of course, are strings of characters
delimited by quotation marks. Now, don’t get
confused here. As I’ve said, XPath expressions
themselves appear as attribute values in XML documents. Therefore, an
expression as a whole will be contained in quotation marks. Within
that expression, any literal strings must be contained in
embedded
quotation marks. If the
expression as a whole is contained in double quotation marks,
"
, then a string within it must be enclosed in
single quotation marks or apostrophes: '
. If you
prefer to enclose attribute values in single quotes, the embedded
string(s) must appear in double quotes.
This nesting of literal quotation marks and apostrophes — or vice
versa — is unnecessary, strictly speaking. If you prefer, you can
escape the literals using their entity
representations. That is, the expressions "a
string"
and "a
string"
are functionally identical. The
former is simply more convenient and legible.
For example, in XSLT stylesheets, one of the most common attributes
is select
, applied to the
xsl:value-of
element (which is empty) and others.
The value of this attribute is an XPath expression. So you might see
code such as the following:
<xsl:value-of select="fallacy[type='pathetic']"/>
If the string “pathetic” were not enclosed in quotation marks, of course, it would be considered a node name rather than a string. (This might make sense in some contexts, but even in those contexts, it would almost certainly produce quite different results from the quoted form.) Note that the kind of quotation marks used in this example alternates between single and double as the quoted matter is nested successively deeper.
Explicitly quoted strings aside, XPath also makes very heavy use of what might be called implicit strings. They might be called that, that is, except there’s already an official term for them: string-values. I will have more to say about string-values later in this chapter. For now, a simple example should suffice.
Consider the following fragment of an XML document:
<type>logical</type> <type>pathetic</type>
Each element in an XML document
has a string-value: the
concatenated value of all text contained by that
element’s start and end tags. Therefore, the first
type
element here has a string-value of
logical
; the second, pathetic
.
An XPath expression in a predicate such as:
type='logical'
would be evaluated for the two elements, respectively, as:
'logical'='logical' 'pathetic'='logical'
That is, for the first type
element the predicate
would
return the value true; for the second, false.
There’s no special magic here. A numeric value in XPath terms is just a number; it can be operated on with arithmetic, and the result of that operation is itself a number. (XPath provides various facilities for converting numeric values to strings and vice versa. Detailed coverage of these facilities can be found in Chapter 4.) Formally, XPath numbers are all assumed to be floating-point numbers even when their explicit representation is as integers.
While XPath assumes all numbers to be of floating-point type, you cannot represent literal numbers in XPath using scientific notation. For example, many languages allow you to represent the number 1960 as 1.96E3 (that is, 1.96 times 10 to the 3rd power); such a value in XPath is not recognized as a legitimate number.
Although the XPath specification does not define “numeric-values” for nodes analogous to their string-values, XPath-aware applications can treat as numeric any string-value that can be “understood” as numeric. Thus, given this XML code fragment:
<page_ref>23</page_ref>
you can construct an XPath expression such as:
page_ref + 10
This would be understood as 23 (the numeric form of the
page_ref
element’s string-value)
plus 10, or 33.
The XPath specification also defines a special
value,
NaN
, for simple “Is this value a
number?” tests.
(“NaN” stands for
“not a number.”) While the spec
repeatedly refers to something called NaN
, it
doesn’t show you how to
use it except as a
string (returned by the XPath string(
)
function, as it happens). If you wanted to locate only
those year
elements which had legitimately
numeric values, you could use an XPath
expression something like this:
string(number(year)) != "NaN"
This explicitly attempts to convert the string-value of the
year
element to a number, then converts the result
of that attempt to a string and compares it to the string
“NaN.”[4] Only those
year
elements for which those two values are not
equal (that is, only those year
elements whose
string-values are not “not a
number”) will pass.
The string( )
function, covered at length in Chapter 4, is extremely important in XPath.
That’s not because it’s used that
much in code — in my experience it isn’t used
much at all — rather, its importance is due to the XPath
spec’s being rife with phrases such as
“...as if converted to a string using the
string( )
function.” As a
practical matter, the string( )
function’s use is implicit in many situations. From
a certain standpoint, you could almost say that
all an XML document’s text
content is understood by an XPath-aware application
“as if converted to
a string using the string(
)
function.”
As elsewhere, in XPath a Boolean value is one that equals either true or false. You can convert a Boolean value to the string or numeric data types, using XPath functions. The string form of the values true and false are (unsurprisingly) the strings “true” and “false”; their numeric counterparts are 1 and 0, respectively.
Probably the single most useful application of XPath Booleans is in the predicate portion of a location step. As I mentioned earlier, the predicate tests some candidate node to see if it fulfills some condition expressed as a Boolean true or false. Thus:
concerto[@key="F"]
locates a concerto
element only if its
key
attribute has a value of
"F"
.
Importantly, as you will see in Chapter 3, the predicate’s true or false value can also test for the simple existence of a particular node. If the node exists, the Boolean value of the predicate is true; if not, false. So:
concerto[@key]
locates a concerto
element only if it has any
key
attribute at all.
The fourth and most important data type handled by XPath is the node-set data type.
Let’s look first at nodes themselves. A node is any discrete logical something able to be located by an XPath location step. Every element in a document constitutes a node, as does every attribute, PI, and so on.
Each node in a document has various properties. I’ve discussed one of these properties briefly already — the string-value — and will provide more information about it at the end of this chapter. The others are its name, its sequence within the document, and its “family relationships” with other nodes.
Most (but not all) nodes have names. To understand node names, you need to understand three terms:
This term, almost always
contracted to
“QName,” is taken straight from the
W3C “Namespaces in XML” spec, at
http://www.w3.org/TR/REC-xml-names. The QName
of a node, in general, is the identifier for the node as it actually
appears in an instance document, including any namespace prefix. For
example, an element whose start tag is
<concerto>
has a QName of
“concerto”; if the start tag were
<mml:concerto>
, the QName would be
“mml:concerto.”
The local-name of a node is its QName, sans any namespace prefix. If an element’s QName is “mml:concerto,” its local-name is “concerto.” If there’s no namespace in effect for a given node, its QName and local-name are identical.
If the node is associated with a particular namespace, its expanded-name is a pair, consisting of the URI associated with that namespace and the local-name. Because the expanded-name doesn’t consider the namespace prefix at all, two elements, for example, can have the same expanded-name even if their QNames are different, as long as both their associated namespace URIs (possibly null) and their local-names are identical. For more information, see Expanded but Elusive later in this chapter.
These three kinds of name conform to common sense in most cases, for most nodes, but can be surprising in others. When covering node types, below, I’ll tell you how to determine the name of a node of a given type.
Nodes in a document are positioned within the document before or after other nodes. Take a look at this sample document:
<?xml-stylesheet type="text/xsl" href="invoice.xsl"?> <statement acct="112233"> <history> <credits> <payment date="2001-09-09" curr="EU">13.99</payment> <adjustment date="2001-09-30" curr="USD">12.64</adjustment> </credits> <debits> <fin_chg date="2001-09-09" curr="USD">1.98</fin_chg> </debits> </history> <current> <!-- No current charges for this customer? --> </current> </statement>
If you were an XML parser reading this document from start to finish,
you’d be following normal document
order. The xml-stylesheet
PI comes
before any of the elements in the document, the
history
element precedes the
current
element, the fin_chg
element precedes the comment
contained by the
current
element, and so on. Also note that XPath
considers the attributes to a given element to come before that
element’s children and other descendants.
This all is pretty much common sense. Be careful when dealing
with attributes, though: XPath considers an
element’s attributes to be in no particular document
order at all. In the above document, whether the various
date
attributes are
“before” the corresponding
curr
attributes is entirely XPath application
dependent. As a practical matter, most XPath applications will
probably be indexing attributes alphabetically, by their
names — so each curr
will precede its
date
counterpart. But you cannot absolutely count
on this behavior.
As you’ll see in Chapter 3, under the discussion of axes, it’s also possible to access nodes in reverse document order.
XML’s strict enforcement of document structure, even under simple well-formedness constraints, ensures that nodes don’t just have a simple document order — even the “nodes” in a comma-separated values (or other plain text) file do that much — but also a set of more complex relationships to one another. Some nodes are parents of others (which are, in turn, children of their parents), and nodes may have siblings, ancestors, and so on.
Because these family relationships are codified in the concept of XPath axes, I’ll defer further discussion of them until Chapter 3.
XPath doesn’t for the most part deal in nodes, but in node-sets. A node-set is simply a collection of nodes, related to one another in some arbitrary way by means of an XPath location step (or full location path). In some cases, sure, a node-set might consist of a single node. But in most cases — especially when the location step is unqualified by a predicate — this is almost an accident, an artifact of the XML instance being navigated via XPath.
Here’s a simple XML document:
<publications> <book>...</book> <book>...</book> <book>...</book> <magazine>...</magazine> </publications>
This location path:
/publications/book
returns a node-set consisting of three book
elements. This location path:
/publication/magazine
returns a single magazine
node. Technically,
though, there’s nothing inherent in this location
path that forces only a single node to be located. This document
just happens to have a single
magazine
element, and as a result, the location
path locates a node-set that just happens in
this case to consist of a single node.
This concept of node-sets
returned by XPath radically departs from
the more familiar counterparts in HTML hyperlinking. Under HTML, a
hyperlink “gets” an entire
document. This is true even if the given HTML link uses a fragment
identifier, such as #top
or
#section1
. (The whole document is still retrieved;
it’s simply positioned within the browser window in
a particular way.) Using XPath, though, what you’re
manipulating is in most cases truly not the entire target document,
but one or more discrete portions of it. In this sense, XPath
isn’t a
“pointing-to” tool;
it’s an extraction
tool.[5]
Also worth noting at this point is that the term node-set carries some implicit baggage of meaning: like a set in mathematical terms, a node-set contains no duplicate nodes (although some may have duplicate string-values) and is intrinsically unordered. When you use XPath to locate as a node-set all elements in a document, there’s no guarantee that you’ll get the members of the node-set back in any particular sequence.
The kinds of node(-set)s retrievable by XPath cover, in effect, any kind of content imaginable: not just elements and attributes, but PIs, comments, and anything else you might find in an XML document. Let’s take a look at these seven node types.
Conspicuously missing from the following list of “any kind of content imaginable” are entity references. There’s no such thing as an “entity reference node,” for example. Why not? Because by the time a document’s contents are scanned by an XPath-aware application, they’ve already been processed by a lower-level application — the XML parser itself. All entity substitutions have already been made. By the same token, XPath can’t locate a document’s XML or DTDs, can’t return to your application any of the contents of an internal DTD subset, and can’t access (for example) tags instead of elements. XPath, in short, can’t “read” a document lexically; it can only “read” it logically.
(See the short section, Section 2.4.3.8 later in this chapter, for a comparison of XPath node types with what the Infoset refers to as “information items.”)
Every XML document has one and only one root node. This is the logical something that contains the entire document — not only the root element and all its contents, but also any whitespace, comments, or PIs that precede and follow the root element’s start and end tags. This “something” is analogous to a physical file, but there may be no precise physical file to which the root node refers (especially in the case of XML documents generated or assembled on the fly and not saved in some persistent form).
In a location path, the root node is represented by a leading
/
(forward slash) character. Any location path
that starts with /
is an absolute location path,
instructing the XPath-aware application, in effect, to start at the
very top of the document before considering the location steps (if
any) that follow. The root node does not have an expanded-name. Its
local-name is an empty string.
Each element node in the document is composed of the element’s start and end tags and everything in between. (Thus, when you retrieve a document’s root element, you’re retrieving everything in the document except any comments and PIs that precede or follow it.) Consider a simple code fragment:
<year> <month monthnum="4">April</month> <month monthnum="8">August</month> <month monthnum="12">December</month> <month monthnum="2">February</month> <month monthnum="1">January</month> <month monthnum="3">March</month> <month monthnum="5">May</month> <month monthnum="6">June</month> <month monthnum="7">July</month> <month monthnum="11">November</month> <month monthnum="10">October</month> <month monthnum="9">September</month> </year>
This location path (which says, “Locate all
month
children of the root year
element whose monthnum
attributes have the value
3”):
/year/month[@monthnum="3"]
selects the sixth month element in the fragment — that is, the
element whose contents (the string
“March”) are bounded by the
<month monthnum="3">
start tag and the
corresponding </month>
end tag. To
emphasize, and to repeat a point made early in this chapter: while
the physical representation of the element is bounded
by its start and end tags, XPath doesn’t
have any understanding at all of tags or any other markup. It just
gets a particular invisible box corresponding to this physical
representation and holding its contents. Importantly, though, it
selects the element as a single node with various properties and
subordinate objects (a name, a string-value, an attribute with its
value).
Note especially that this example does not
locate the third month
element. It selects all
month
elements with the indicated
monthnum
attribute value.
You sometimes must take care, when selecting element nodes, not to be confused by the presence of “invisible whitespace” in their string-values.
Yes, true: all whitespace is invisible. (That’s why it’s called whitespace, right?) But the physical appearance of XML documents can trick you into thinking that some whitespace “doesn’t count,” even though that’s not necessarily true. For instance, consider Figure 2-2, depicting a greatly simplified version of a document in the same vocabulary we’ve been using in this section.
In this figure, as you can see, there’s no whitespace in any of the document’s content, only within element start tags (and not always there). While many real-world XML documents (especially those that are machine generated) appear this way, it’s just as likely that the documents you’ll be excavating with XPath will look like Figure 2-3.
The human eye tends to ignore the whitespace-only blocks of text (represented with gray blocks in the figure) in a document like this one, discarding them as insignificant to the document’s meaning. But XML parsers, bound by the XML spec’s “all text counts” constraint, are not free to ignore these scraps of whitespace. (Some parsers may flag the whitespace as potentially “insignificant,” leaving to some higher-order application the task of ignoring it or not.) So consider now the effect of an XPath expression such as the following, when applied to the document in Figure 2-3:
/year
This location path doesn’t return just the
year
element node, the month
element node and its attribute. It also returns:
Some blank spaces, a newline, and some more blank spaces preceding
the month
element
A newline following the month
element
Whether this will present you with a problem depends on your specific
need. If it is a problem, there’s an XPath function,
normalize-space( )
(covered in Chapter 4), that trims all leading and trailing
whitespace from a given element’s content.
In XPath, as in many other XML-related areas, dealing with whitespace
can induce either euphoria or migraines. In addition to the
normalize-space( )
XPath function covered in Chapter 4, you should consider the (default or explicit)
behavior of XML’s own built-in
xml:space
attribute, and — depending on your
application’s needs — the effects of the XSLT
xsl:strip-space
and
xsl:preserve-space
elements, as well as the
preserveWhiteSpace
property of the MSXML
Document
object (if you’re
working in a Microsoft scripting environment).
The local-name of an element node is the name of the element type (that is, its generic identifier (GI), as it appears in the element’s start and optional end tags). Thus, its expanded-name equals either its local-name (if there isn’t a namespace in effect for that element) or its associated namespace URI paired with the local-name. Consider this code fragment:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> ... </html> </xsl:template> </xsl:stylesheet>
All elements in this document whose names carry the
xsl
: prefix are in the namespace associated with
the URI
“http://www.w3.org/1999/XSL/Transform.”
Thus, the expanded-name of the xsl:stylesheet
element consists of that URI paired with the local-name,
“stylesheet.”
Attributes, in a certain sense, “belong to” the elements in which they appear, and in the same sense, they might be thought to adopt the namespace-related properties of those elements. For instance:
<xsl:template match="/" />
Logically, you might conclude that the match
attribute is “in” the same
namespace as the xsl:template
element — it
does, after all, belong to the same XML vocabulary — and that it,
therefore, has something like an implied namespace prefix.
This isn’t the case, though. An
attribute’s QName, local-name, namespace URI, and
hence, expanded-name are all determined solely on the basis of the
way the attribute is represented in the XML source document.
Attributes such as match
in the above
example — with no explicit prefix — have a null namespace
URI. That is, unprefixed attributes are in no
namespace, including the default one; thus, their QName and
local-name are identical.
Note that namespace declarations, which look like attributes (and
indeed are attributes, according to the XML 1.0 and
“Namespaces in XML”
Recommendations), are not considered the same as other attributes
when you’re using XPath. As one example, the start
tag of a typical xsl:stylesheet
root element in a
typical XSLT stylesheet might look something like this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">
A location path intended to locate all this element’s attributes might be:
/xsl:stylesheet/@*
(As a reminder, the wildcard asterisk here retrieves all attribute
nodes, regardless of their names.) In fact, this location path
locates only the version
attribute. The
xmlns:xsl
and xmlns
attributes,
being namespace declarations instead of
“normal” attributes, are
not locatable as attribute nodes.
If the document referred to by the XPath expression is validated against a DTD, it may contain more attributes than are present explicitly — visibly — in the document itself. That’s because the DTD may declare some attributes with default values in the event the document author has not supplied those attributes explicitly. Always remember that the “document” to which you’re referring with XPath is the source document as parsed, which may be only more or less the source document that you “see” when reading it by eye.
Processing instructions, by
definition, stand outside the vocabulary
of an XML document in which they appear. Nonetheless, they do have
names in an XPath sense: the name of a PI is its target (the
identifier following the opening <?
delimiter).
For this PI:
<?xml-stylesheet type="text/css" href="mystyle.css"?>
the QName and local-name are both xml-stylesheet
.
However, because a PI isn’t subject to namespace
declarations anywhere in a document — PIs, like unprefixed
attributes, are always in no namespace — its
namespace URI is null.
The other thing to bear in mind when dealing with PI nodes is that
their pseudo-attributes look like, but are not,
real attributes (hence, the
“pseudo-” prefix). From an XPath
processor’s point of view, everything between the PI
target and the closing ?>
delimiter is a single
string of characters. In the case of the PI above,
there’s no node type capable of locating the
type
pseudoattribute separate from the
href
pseudoattribute, for example. (You can,
however, use some of the string-manipulation functions covered in
Chapter 4 to tease out the discrete
pseudoattributes.)
Each comment in an XML source document may be located independently of the surrounding content. A comment has no expanded-name at all, and thus has neither a QName, a local-name, nor a namespace URI.
Any contiguous block
of text — an
element’s #PCDATA
content — constitutes a text node. By
“contiguous” here I mean that the
text is unbroken by any element, PI, or comment nodes. Consider a
fragment of XHTML:
<p>A line of text.<br/>Another line.</p>
The p
element here contains not just one but two
text nodes, “A line of text.” and
“Another line.” The intervening
br
element breaks them up into two. The presence
or absence of whitespace in the #PCDATA
content is
immaterial. So in the following case:
<p>A line of text. Another line.</p>
there’s still a single text node, which, like a comment, has no expanded-name at all.
Namespace nodes are the chimeras and Loch Ness monsters of XPath. They have characteristics of several other node types but at the same time are not “real,” but rather fanciful creatures whose comings and goings are marked with footprints here and there rather than actual sightings.
The XPath spec says every element in a given document has a namespace node corresponding to each namespace declaration in scope for that element:
One for every explicit declaration of a namespace prefix in the start tag of the element itself
One for every explicit declaration of a namespace prefix in the start tag of any containing element
One for the explicit xmlns=
declaration, if any,
of a namespace for unprefixed element/attribute
nodes, whether this declaration appears in the
element’s own start tag or in that of a containing
element
Here’s a simple fragment of an XSLT stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml"> <xsl:template match="/"> <html xmlns:xlink="http://www.w3.org/1999/xlink/namespace"> ... </html> </xsl:template> </xsl:stylesheet>
Three explicit namespace declarations are made in this fragment:
The xsl
: namespace prefix is associated (in the
xsl:stylesheet
element’s start
tag) with the XSLT namespace URI,
“http://www.w3.org/1999/XSL/Transform.”
The default namespace — that is, for any unprefixed element names
appearing within the xsl:stylesheet
element — is associated with the XHTML namespace,
“http://www.w3.org/1999/xhtml.”
The xlink
: namespace prefix is associated (in the
html
element’s start tag) with
the namespace URI set aside by the W3C for XLink elements and
attributes, “http://www.w3.org/1999/xlink/namespace.”
There’s also one other namespace implicitly in
effect for all elements in this, and indeed any, XML document. That
is the XML namespace itself, associated with the reserved
xml
: prefix. The corresponding namespace URI for
this implied namespace declaration is
“http://www.w3.org/XML/1998/namespace.”
The namespace declarations for the xsl
: and
default namespace prefixes both appear in the root
xsl:stylesheet
element’s start
tag; therefore, they create implicit namespace nodes on every element
in the document — including those for which those declarations
might not seem to make much sense. The html
element, for instance, will probably not be able to make much use of
the namespace node associated with the xsl
:
prefix. Nonetheless, in formal XPath terms that namespace node is
indeed in force for the html
element.
The namespace declaration for the xlink
: prefix,
on the other hand, is made by a lower-level element
(html
, here). Thus, there is no namespace node
corresponding to that prefix for the higher-level
xsl:template
and xsl:stylesheet
elements.
Each namespace node also has a local-name: the associated namespace prefix. So the local-name of the namespace node representing the XSLT namespace in the above document is “xsl.” When associated with the default namespace, the namespace node’s local-name is empty. The namespace URI of a namespace node, somewhat bizarrely, is always null.
In XPath, as in most — maybe all — XML-related subjects, namespaces sometimes seem like more trouble than they’re worth. The basic purpose of namespaces is simple: to disambiguate the names of elements, and perhaps attributes, from more than one XML vocabulary when they appear in a single document instance. Yet the practice of using namespaces leads one down many hall-of-mirrors paths, with concepts and syntax nested inside other concepts and syntaxes and folding back on themselves.
As a practical matter, you will typically have almost no use for identifying or manipulating namespace nodes at all; your documents will consist entirely of elements and attributes from a single namespace.
The XML Information Set (commonly referred to simply as “the Infoset”) is a W3C Recommendation published in October 2001 (http://www.w3.org/TR/xml-infoset/). Its purpose, as stated in the spec’s Abstract, is to provide “a set of definitions for use in other specifications that need to refer to the information in an XML document.”
The definitions the Infoset provides are principally in terms of 11 information items: document, element, attribute, processing-instruction, unexpanded entity reference, character, comment, document type declaration, unparsed entity, notation, and namespace. As you can see, there’s a certain amount of overlap between this list and the node types available under XPath — and also a certain number of loose ends not provided at all by one or the other of the two Recommendations.
XPath 2.0 will resolve the conflicts in definitions of Infoset information items and XPath node types; at the same time, XPath will continue to need things the Infoset does not cover. For instance, XPath does not generally need to refer to atomic-level individual character information items. Instead, it needs to refer to the more “molecular” text nodes. For these “needed by XPath but not defined under the Infoset” information items, XPath 2.0 will continue to provide its own definitions.
For more information about XPath 2.0 and the Infoset, refer to Chapter 5.
It’s hard to imagine a node in an XML document that exists in isolation, devoid of any context. First, as I’ve already mentioned, nodes have relationships with other nodes in the document — both document-order and “family” relationships. Maybe more importantly, but also more subtly, nodes in the node-set returned by a location path also have properties relative to the other nodes in that node-set — even when document order is irrelevant and family relationships, nonexistent. These are the properties of context size, context position, and namespace bindings.
Consider the following XML document:
<ChangeInMyPocket> <Quarters quantity="1"/> <Dimes quantity="1"/> <Nickels quantity="1"/> <Pennies quantity="3"/> <!-- No vending-machine purchase in my immediate future --> </ChangeInMyPocket>
It’s possible, in a single location path, to locate
only the four quantity
attributes (or any three,
two, or one of them) and the comment; or just the root node and the
comment; or just the Quarters
element, the
Pennies
element, and the
quantity
attribute of the
Nickels
element. The nodes in the resulting
node-set need not share any significant formal relationship in the
context of the document itself. But in all cases, these nodes
suddenly acquire relationships to others in a given node-set, simply
by virtue of their membership in that node-set.
The context size is simply the number of nodes
in
the
node-set, irrespective of the type of nodes. A location path that
returned a node-set of all element nodes in the above document would
have a context size of 5 (one for each of the
ChangeInMyPocket
, Quarters
,
Nickels
, Dimes
, and
Pennies
elements). A location path returning all
the quantity
attributes and the comment would also
have a context size of 5.
The context position is different for every
member
of the node-set:
it’s the integer representing the ordinal position
that a given node holds in the node-set, relative to all other nodes
in it, in document order. If a node-set consists of all child
elements of the ChangeInMyPocket
element, the
Quarters
element will have a context position of
1, the Dimes
element, 2, and so on. In a different
node-set, the Quarters
element might be node 2 and
Dimes
, 1, and so on.
I’ve alluded to this before but just as a reminder:
when determining context position, particularly of elements, be aware
of “invisible whitespace”
separating one element’s end tag from the succeeding
one’s start tag. In the above document, a location
path that retrieves all children of the
ChangeInMyPocket
element, not just the child
elements, will also locate all the newline and
space characters used for
“pretty-printing”; each block of
these characters constitutes a text-node child of
ChangeInMyPocket
. Thus, the
Quarters
element will have a context position of
2, the Dimes
, 4, and so on.
Chapter 3 and Chapter 4 go into more detail about dealing with context position and context size. Note especially, in Chapter 3, the material about reverse document order in certain XPath axes, because this inverts the normal sequence of context positions.
The term namespace bindings refers to any
namespace declarations in effect at the time an XPath
expression is evaluated. In the previous document, which has no
explicit namespace declarations, the only namespace binding in any
expression’s evaluation context will be the
“built-in” namespace for elements
and attributes whose names are prefixed xml
:. Note
that any namespace binding is not tied to a particular prefix,
however; what’s important is the
URI to which the prefix is bound. Consider the
following fragment:
<myvocab:root xmlns:myvocab="http://myvocab.com/namespace" xmlns:yourvocab="http://myvocab.com/namespace"> <yourvocab:subelem> [etc.] </yourvocab> </myvocab>
A superficial consideration of the namespace bindings in effect for
the above yourvocab:subelem
document might suggest
that there are two, one for the myvocab
: prefix
and one for yourvocab
:. Not true.
There’s only one namespace URI in play at that point
(although it’s aliased, after a fashion, by the two
prefixes), and
hence, there’s only
one namespace binding in that element node’s
context.
By definition, a well-formed XML document is a text document, incapable of containing such “binary” content as multimedia files and images. Thus, it stands to reason that in navigating XML documents via XPath the strings of text that make up the bulk of the document (aside from the element names themselves) would be of supreme importance. This notion is codified in the concept of string-values. And the importance of string-values lies in the fact that most of the time, when you locate a node in a document via XPath, what you’re after is not the node per se but rather its string-value.
Each node returned by a location path has its own string-value. The
string-value of a node depends on the node type, as summarized in
Table 2-1. Note that the word
“normalized” used to describe the
string-value for the attribute node type is the same as elsewhere in
the markup world: it means stripped of extraneous whitespace, by
trimming leading and trailing whitespace and collapsing multiple
consecutive occurrences of whitespace into a single space. For
example, given an attribute such as region=" NW
SE"
(note leading blank spaces and multiple spaces
between the "NW"
and "SE"
), its
normalized value would be "NW
SE"
. Also note, though, that this normalization
depends on the attribute’s type, as declared in a
DTD or schema. If the attribute type is CDATA, those interior blank
spaces would be assumed to be significant and not normalized.
Therefore, if the region
attribute is (say) of
type NMTOKENS, the interior whitespace is collapsed; if
it’s CDATA, the whitespace remains.
Node type |
String-value |
Root |
Concatenated value of all text nodes in the document |
Element |
Concatenated value of all text nodes within the scope of the element’s start and end tags, including the text nodes contained by any descendant elements |
Attribute |
Normalized value of the attribute |
PI |
Everything in the PI between its target (and whitespace following the
target) and the closing |
Comment |
The comment’s content — the text between the
opening and closing |
Text |
The character data in the node (note that every text node consists of at least one character) |
Namespace |
The namespace URI associated with the corresponding namespace prefix |
If you’re using DOM, note that Table 2-1 establishes a loose correspondence
between XPath string-values and the
values returned by the DOM nodeValue
method. The
exceptions — and they’re important
ones — are that nodeValue
, when applied to the
document root and element nodes, returns not a concatenated string
but a null value. The only way to get at these node
types’ text content through the DOM is to apply
nodeValue
to their descendant text nodes.
Consider an XML document such as the following:
<?xml-stylesheet type="text/xsl" href="4or5guys.xsl"?> <quotation xmlns:xlink="http://www.w3.org/1999/xlink"> <source> <author>Firesign Theatre</author> <work year="1970">Don't Crush that Dwarf, Hand Me The Pliers</work> </source> <text>And there's hamburger all over the highway in Mystic, Connecticut.</text> <!-- Following link last verified 2001-09-15 --> <allusion xlink:href="http://www.dern.com/ng_burgr.html"/> </quotation>
All seven XPath node types are present in this document. String-values for some of them are as follows:
Concatenated values of all text nodes in the document, that is:
Firesign Theatre Don't Crush that Dwarf, Hand Me The Pliers And there's hamburger all over the Highway in Mystic, Connecticut.
(Note how the whitespace-only text nodes, included for legibility in the original document, are carried over into the string-value.)
source
element nodeConcatenated value of all text nodes within the element’s scope (including whitespace-only text nodes):
Firesign Theatre Don't Crush that Dwarf, Hand Me The Pliers
year
attribute1970
xml-stylesheet
PItype="text/xsl" href="4or5guys.xsl"
Following link last verified 2001-09-15
Firesign Theatre
The namespace for the xlink
: prefix, declared in
the root quotation
element, does not apply to any
elements, because none of their names use that prefix. All of these
elements have empty strings because they are not in a namespace.
Not only does each node in a node-set have a string-value, as described above; the node-set as a whole has one.
If you followed the logic behind each of the previous examples, especially the concatenation of text nodes that makes up the string-value of a root or element node, you might think the string-value of a node-set containing (say) two or more element nodes is the concatenation of all their string-values. Not so. The string-value of a multinode node-set is the string-value of the first node in that node-set.
(Actually, the apparent inconsistency goes away if you just remember that last sentence, eliminating the word “multinode.” Thus, the value of any single node is just a special case of the general rule; the node-set in this case just happens to be composed of a single node — which is, of course, the first in the node-set.)
In the previous example, the source
element has
two child element nodes, author
and
work
. This location path:
/quotation/source/*
thus returns a node-set of two nodes. The node-set’s
string-value is the string-value of the first node in the node-set,
that is, the author
element:
“Firesign Theatre.”
[1] Not that you’ll see any further references to something by that name, in the spec or anywhere else.
[2] Depending on the context, such an unquoted token may also be interpreted as a function (covered in Chapter 4), a node test (see Chapter 3), or of course a literal number instead of a string.
[3] Be careful on
this issue of escaping the < and > characters. XPath is used in
numerous contexts (such as JavaScript and other scripting languages)
besides “true XML”; in these
contexts, use of a literal, unescaped <
or
>
character may actually be mandated.
[4] Note the
importance here of quoting the string
“NaN.” If this code fragment had
omitted the quotation marks, the XPath processor would not be testing
for the special NaN
value but for the string-value
of an element whose name just happens to be
NaN
.
[5] XHTML, the “reformulation as
XML” of the older HTML standard, is kind of a
special case. Because an XHTML document is an XML document, it may
use XPath-based XPointers in the value of an href
attribute. But you can’t assume that a browser, for
now, will conform to the expected behavior of a true XPointer-aware
application. Browser vendors don’t exactly leap out
of the starting gate to adopt new standards.
18.222.125.171