Chapter 18. Working with Strings

Strings are probably the most used type of atomic values in queries. This chapter discusses constructing and comparing strings and provides an overview of the many built-in functions that manipulate strings. It also explains string- and text-related features such as whitespace handling and internationalization.

The xs:string Type

The basic string type that is intended to represent generic character data is called, appropriately, xs:string. The xs:string type is not the default type for untyped values. If a value is selected from an input document with no schema, the value is given the type xs:untypedAtomic, not xs:string. However, it is easy enough to cast an untyped value to xs:string. In fact, you can cast a value of any type to xs:string and cast an xs:string value to any type.

The xs:string type is a primitive type from which a number of other types are derived. All the operations and functions that can be performed on xs:string values can also be performed on values whose types are restrictions of xs:string. This includes user-defined types that appear in a schema, as well as built-in derived types such as xs:token, xs:language, and xs:ID. For a complete explanation of the built-in types, see Appendix B.

Constructing Strings

There are three common ways to construct strings: using string literals, the xs:string constructor, and the string function.

String Literals

Strings can be included in queries as literals, using double or single quotes. For example, ($name = "Priscilla") and string-length('query') are valid expressions that contain string literals. If a literal value is enclosed in quotes, it is automatically assumed to be a string as opposed to a number.

Between quotes, you can escape the surrounding quote character by including it twice. For example, the literal expression "inner ""quotes""!" evaluates to the string inner "quotes"!. This is true for both single and double quotes.

In string literals, you can use single character references that use XML syntax. For example,   can be used to include a space. You can also use the predefined entity references <, >, &, ", and '. For example, you can specify the string literal "PB&J" to represent the string PB&J. In fact, ampersands must be escaped with & in string literals.

The xs:string Constructor and the string Function

There is a standard constructor for strings named xs:string. The xs:string constructor, like all constructors, accepts either a single atomic value or a single node. If it is an atomic value, it simply returns that value cast as an xs:string.

Some types have special rules about how their values are formatted when they are cast to xs:string. For example, integers have their leading zeros stripped, and xs:hexBinary values have their letters converted to uppercase. In addition, when values of most non-string types are cast to xs:string, their whitespace is collapsed. This means that consecutive whitespace characters are replaced by a single space, and leading and trailing whitespace is removed. The rules (if any) for each type are described in Appendix B.

If the xs:string constructor is passed a node, it uses atomization to extract the typed value of the node, and then casts it to xs:string. For an attribute, this is simply its value. For an element, it is the character data of the element itself and all its descendants, concatenated together in document order.

In addition, there is a built-in function named string that has almost identical behavior. One difference is that if you use the string function with no arguments, it will use the current context item.

String Constructors

Starting in version 3.1, there is an additional type of expression, called a string constructor, that allows you to create literal strings with intermingled expressions. This is especially useful for generating strings that are in the syntax of languages such as JSON, HTML or CSS that use curly brackets, angle brackets, quotation marks, or other strings that are delimiters in XQuery 3.1.

String constructors are delimited by ``[ and ]``. Within these delimiters, expressions known as interpolations can appear, delimited by `{ and }`. The rest of the characters are considered literal characters. For example:

let $prod1 := <product dept="WMN">
    <number>557</number>
    <name language="en">Fleece Pullover</name>
    <colorChoices>navy black</colorChoices>
  </product>
return ``[Name: `{$prod1/name}`, Number: `{$prod1/number}`]``

returns the string:

Name: Fleece Pullover, Number: 557

String constructors can be used to generate strings in HTML syntax. For example:

let $prod1 := <product dept="WMN">
    <number>557</number>
    <name language="en">Fleece Pullover</name>
    <colorChoices>navy black</colorChoices>
  </product>
return ``[<h1>`{$prod1/name}`</h1>
<p>Number:&nbsp;`{$prod1/number}`</p>
<h2>Colors</h2>
`{for $color in $prod1/colorChoices/tokenize(.)
  return ``[<li>`{$color}`</li> ]``
}`
]``

returns the following as a string:

<h1>Fleece Pullover</h1>
<p>Number:&nbsp;557</p>
<h2>Colors</h2>
<li>navy</li>
<li>black</li>

In the above example, the string constructor that outputs the color is nested within an interpolation inside another string constructor. Such nesting makes string constructors a powerful tool for templating the syntax of other languages, especially non-XML languages.

Comparing Strings

Several functions, summarized in Table 18-1, are available for comparing and matching strings.

Table 18-1. Functions that compare strings
Function nameDescription
compareCompares two strings, optionally based on a collation, returning -1, 0, or 1
codepoint-equalCompares two strings based on codepoints, returning a Boolean value
starts-withDetermines whether a string starts with another string
ends-withDetermines whether a string ends with another string
containsDetermines whether a string contains another string
contains-tokenDetermines whether a string contains another string surrounded by whitespace
matchesDetermines whether a string matches a regular expression

Comparing Entire Strings

Strings can be compared using the comparison operators: =, !=, >, <, >=, and <=. For example, "abc" < "def" evaluates to true.

The comparison operators use the default collation, as described in “Collations”. You can also use the compare function, which fulfills the same role as the comparison operators but allows you to explicitly specify a collation. The compare function accepts two string arguments and returns one of the values -1, 0, or 1, depending on which argument is greater.

Determining Whether a String Contains Another String

Four functions test whether a string contains the characters of another string. They are the contains, contains-token, starts-with, and ends-with functions. Each of them returns a Boolean value and takes two strings as arguments: the first is the containing string being tested, and the second is the contained string. (The contains-token function will also accept a sequence of multiple strings as its first argument.) Table 18-2 shows some examples of these functions.

Table 18-2. Examples of contains, contains-token, starts-with, and ends-with
ExampleReturn value
contains("query", "ery")true
contains("query", "x")false
contains-token("xml query", "query")true
contains-token( ("xml", "query"), "query")true
starts-with("query", "que")true
starts-with("query", "u")false
ends-with("query", "y")true
ends-with("query ", "y")false

Matching a String to a Pattern

The matches function determines whether a string matches a pattern. It accepts two string arguments: the string being tested and the pattern itself. The pattern is a regular expression, whose syntax is covered in Chapter 19. There is also an optional third argument, which can be used to set additional options in the interpretation of the regular expression, such as multi-line processing and case sensitivity. These options are described in detail in “Using Flags”. Table 18-3 shows examples of the matches function.

Table 18-3. Examples of the matches function
ExampleReturn value
matches("query", "q")true
matches("query", "qu")true
matches("query", "xyz")false
matches("query", "q.*")true
matches("query", "[a-z]{5}")true

Substrings

Three functions are available to return part of a string. The substring function returns a substring based on a starting position (starting at 1, not 0) and optionally a length. For example:

substring("query", 2, 3)

returns the string uer. If no length is specified, the function returns the rest of the string. For example:

substring("query", 2)

returns uery.

The substring-before function returns all the characters of a string that occur before the first occurrence of another specified string. The substring-after function returns all the characters of a string that occur after the first occurrence of another specified string. Table 18-4 shows examples of the substring functions.

Table 18-4. Examples of the substring functions
ExampleReturn value
substring("query", 2, 3)uer
substring("query", 2)uery
substring-before("query", "er")qu
substring-before("queryquery", "er")qu
substring-after("query", "er")y
substring-after("queryquery", "er")yquery

Finding the Length of a String

The length of a string can be determined using the string-length function. It accepts a single string and returns its length as an integer. Whitespace is significant, so leading and trailing whitespace characters are counted. Table 18-5 shows some examples.

Table 18-5. Examples of the string-length function
ExampleReturn value
string-length("query")5
string-length(" query ")7
string-length(normalize-space(" query "))5
string-length("")0
string-length("&#x20;")1

Concatenating and Splitting Strings

Six functions, summarized in Table 18-6, concatenate and split apart strings.

Table 18-6. Functions that concatenate and split apart strings
NameDescription
concatConcatenates two or more strings
string-joinConcatenates a sequence of strings, optionally using a separator
tokenizeBreaks a single string into a sequence of strings, using a specified separator
analyze-stringSplits a string based on parts that match and don’t match a pattern
codepoints-to-stringConverts a sequence of Unicode codepoint values to a string
string-to-codepointsConverts a string to a sequence of Unicode codepoint values

Concatenating Strings

Strings can be concatenated together using one of two functions: concat or string-join. The concat function accepts individual string arguments and concatenates them together. This function is unique in that it accepts a variable number of arguments. For example:

concat("a", "b", "c")

returns the string abc. The string-join function, on the other hand, accepts a sequence of strings. For example:

string-join( ("a", "b", "c"))

also returns the string abc. In addition, string-join allows a separator to be passed as the second argument. For example:

string-join( ("a", "b", "c"), "/")

returns the string a/b/c.

Starting in version 3.0, there is also a string concatenation operator, the double vertical bar (||). This has the same effect as the concat function but is slightly more convenient syntactically. For example:

"a" || "b" || "c"

returns the string abc. As with the concat function, the operands can be single nodes or atomic values of any type. They are atomized (if necessary) and cast to xs:string before concatenation. A single operand of this operator cannot, however, be a sequence of multiple values. For that, the string-join function is still the best option.

Splitting Strings Apart

Strings can be split apart, or tokenized, using the tokenize function. This function breaks a string into a sequence of strings, using a regular expression to designate the separator character(s). For example:

tokenize("a/b/c", "/")

returns a sequence of three strings: a, b, and c. Regular expressions such as s, which represents a whitespace character (space, line feed, carriage return, or tab), and W, which represents a non-word character (anything other than a letter or digit) are often used with this function. A list of useful regular expressions for tokenization can be found in Appendix A, in the “tokenize” section. Table 18-7 shows some examples of the tokenize function.

Table 18-7. Examples of the tokenize function
ExampleReturn value
tokenize("a b c", "s")("a", "b", "c")
tokenize("a b c", "s+")("a", "b", "c")
tokenize("a-b--c", "-")("a", "b", "", "c")
tokenize("-a-b-", "-")("", "a", "b", "")
tokenize("a/ b/ c", "[/s]+")("a", "b", "c")
tokenize("2015-12-25T12:15:00", "[-T:]")("2015", "12", "25", "12", "15", "00")
tokenize("Hello, there.", "W+")("Hello", "there")

The analyze-string function can also be used to split apart strings. This function is especially useful if you want to keep both the matching and non-matching parts of the string (as opposed to tokenize, which throws away the delimiters).

In order to provide a structured result that contains both matches and non-matches, the analyze-string function returns an XML element named fn:analyze-string-result that contains elements called fn:match for each part of a string that matches the regular expression, and fn:non-match for each part that does not match.

For example, the following:

analyze-string("can be reached at 231-555-1212 or",
  "d{3}-d{3}-d{4}")

will return the following XML, which could then be traversed to, for example, tag the phone number but keep the surrounding text:

<fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions">
  <fn:non-match>can be reached at </fn:non-match>
  <fn:match>231-555-1212</fn:match>
  <fn:non-match> or</fn:non-match>
</fn:analyze-string-result>

Converting Between Codepoints and Strings

Strings can be constructed from a sequence of Unicode codepoint values (expressed as integers) using the codepoints-to-string function. For example:

codepoints-to-string( (97, 98, 99) )

returns the string abc. The string-to-codepoints function performs the opposite; it converts a string to a sequence of codepoints. For example:

string-to-codepoints("abc")

returns a sequence of three integers: 97, 98, and 99.

Manipulating Strings

Four functions can be used to manipulate the characters of a string. They are listed in Table 18-8.

Table 18-8. Functions that manipulate strings
Function nameDescription
upper-caseTranslates a string into uppercase equivalents
lower-caseTranslates a string into lowercase equivalents
translateReplaces individual characters with other individual characters
replaceReplaces characters that match a regular expression with a specified string

Converting Between Uppercase and Lowercase

The upper-case and lower-case functions are used to convert a string to all uppercase or lowercase. For example, upper-case("Query") returns QUERY. The mappings between lowercase and uppercase characters are determined by Unicode case mappings. If a character does not have a corresponding uppercase or lowercase character, it is included in the result string unchanged. Table 18-9 shows some examples.

Table 18-9. Examples of the upper-case and lower-case functions
ExampleReturn value
upper-case("query")QUERY
upper-case("Query")QUERY
lower-case("QUERY-123")query-123
lower-case("Query")query

Replacing Individual Characters in Strings

The translate function is used to replace individual characters in a string with other individual characters. It takes three arguments:

  • The string to be translated

  • The list of characters to be replaced (as a string)

  • The list of replacement characters (as a string)

Each character in the second argument is replaced by the character in the same position in the third argument. For example:

translate("**test**321", "*123", "-abc")

returns the string --test--cba. If the second argument is longer than the third argument, the extra characters in the second argument are simply omitted from the result. For example:

translate("**test**321", "*123", "-")

returns the string --test--.

Replacing Substrings That Match a Pattern

The replace function is used to replace non-overlapping substrings that match a regular expression with a specified replacement string. It takes three arguments:

  • The string to be manipulated

  • The pattern, which uses the regular expression syntax described in Chapter 19

  • The replacement string

While it is nice to have the power of regular expressions, you don’t have to be familiar with regular expressions to replace a particular sequence of characters; you can simply specify the string you want replaced for the $pattern argument, as long as it doesn’t contain any special characters.

An optional fourth argument allows for additional options in the interpretation of the regular expression, such as multi-line processing and case sensitivity. Table 18-10 shows some examples.

Table 18-10. Examples of the replace function
ExampleReturn value
replace("query", "r", "as")queasy
replace("query", "qu", "quack")quackery
replace("query", "[ry]", "l")quell
replace("query", "[ry]+", "l")quel
replace("query", "z", "a")query
replace("query", "query", "")A zero-length string

XQuery also supports variables in the replacement text, which allow parenthesized sub-expressions to be referenced by number. You can use the variables $1 through $9 to represent the first nine parenthesized expressions in the pattern. This is very useful when replacing strings, on the condition that they come directly before or after another string. For example, if you want to change instances of the word Chap to the word Sec, but only those that are followed by a space and a digit, you can use the function call:

replace("Chap 2...Chap 3...Chap 4...", "Chap (d)", "Sec $1.0")

which returns Sec 2.0...Sec 3.0...Sec 4.0.... Sub-expressions are discussed in more detail in “Using Sub-Expressions with Replacement Variables”.

Whitespace and Strings

Whitespace handling varies by implementation and depends on whether the implementation uses schema validation, and how it chooses to handle whitespace in element content. Every XML parser normalizes the whitespace in attribute values, replacing carriage returns, line feeds, and tabs with spaces. XML Schema processors may further normalize whitespace of an attribute or element value based on its type. During XML Schema validation, whitespace is preserved in values of type xs:string (and some of its derived types), but collapsed in all others.

Within string literals in queries, whitespace is always significant. For example, the expression string-length(" x ") evaluates to 3, not 1.

Normalizing Whitespace

The normalize-space function collapses whitespace in a string. Specifically, it performs the following steps:

  1. Replaces each carriage return (#xD), line feed (#xA), and tab (#x9) character with a single space (#x20)

  2. Collapses all consecutive spaces into a single space

  3. Removes all leading and trailing spaces

Table 18-11 shows some examples.

Table 18-11. Examples of the normalize-space function
ExampleReturn value
normalize-space("query")query
normalize-space(" query ")query
normalize-space("xml query")xml query
normalize-space(" xml  query ")xml query
normalize-space(" ")A zero-length string

Internationalization Considerations

XML, through its support for Unicode, is designed to allow for many natural languages. XQuery provides several functions and mechanisms that support multiple natural languages: collations, the normalize-unicode function, and the lang function.

Collations

Collations are used to specify the order in which characters should be compared and sorted. Characters can be sorted simply based on their codepoints, but this has some limitations. Different languages and locales alphabetize the same set of characters differently. In addition, an uppercase letter and its lowercase equivalent may need to be sorted together. For example, if you sort on codepoints alone, an uppercase A comes after a lowercase z.

Collations are not just for sorting. They can be used to equate two strings that contain equivalent values. Some languages and locales may consider two different characters or sequences of characters to be equivalent. For example, a collation may equate the German character β with the two letters ss. This type of comparison comes into play when using, for example, the contains function, which determines whether one string contains the characters of another string.

Collations in XQuery are identified by URIs. The URI serves only as a name and does not necessarily point to a resource on the Web, although it might.

Supported collations

All XQuery implementations recognize at least three collation URIs:

http://www.w3.org/2005/xpath-functions/collation/codepoint

The Unicode Codepoint Collation that simply compares strings based only on Unicode codepoints.

http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive

The HTML ASCII case-insensitive collation that compares strings in a case-insensitive manner, where the uppercase letters A to Z are considered equivalent to the lowercase letters a to z. Other characters are compared based on their Unicode code points. This is defined by HTML and is used, for example, to compare HTML class attributes.

http://www.w3.org/2013/collation/UCA

The Unicode Collation Algorithm is a far more sophisticated collation. The collation URI can be followed by a number of query parameters, for example:

http://www.w3.org/2013/collation/UCA?lang=se;numeric=yes

indicates that the language is Swedish and that consecutive integers should be sorted as numbers. While processors are required to recognize this collation URI, they are by default not required to fully support it and can use a different collation as a fallback.

Some implementations support additional collations. You should consult the documentation of your XQuery implementation to determine which collations are supported, including more details about what query parameters are supported for the Unicode Collation Algorithm and what fallback collations are used. If an unsupported collation URI is specified in a query, an error is raised.

Specifying a collation

There are several ways to specify a collation. Some XQuery functions, such as compare and distinct-values, accept a $collation argument that allows you to specify the collation URI. For example:

distinct-values(doc("catalog.xml")//@dept,"http://datypic.com/collation/custom")

In addition, you can specify a collation in the order by and group by clauses of a FLWOR. For example:

order by $d collation "http://datypic.com/collation/custom"

or:

group by $d collation "http://datypic.com/collation/custom"

You can also specify a default collation in the query prolog. This default is used by some functions as well as order by and group by clauses when no collation keyword is specified. The default collation is also used in operations that do not allow you to specify collation, such as those using the comparison operators =, !=, <, <=, >, and >=. The syntax of a default collation declaration is shown in Figure 18-1.

Figure 18-1. Syntax of a default collation declaration

An example is:

declare default collation "http://datypic.com/collation/custom";

Regardless of how a collation URI is specified, it must be a literal value in quotes (not an evaluated expression), and it should be a syntactically valid URI. If a relative URI is provided, it is relative to the static base URI, which is described in “Static base URI”.

Alternatively, the implementation may have a built-in default collation, or allow a user to specify one, through means other than the query prolog. The default collation can be obtained using the default-collation function, which takes no arguments.

As a last resort, if no $collation argument is provided, no default collation is specified, and the implementation does not provide a default collation, then the simple Unicode Codepoint Collation is used.

Although it is possible in XML to use an xml:lang attribute to indicate the natural language of character data, use of this attribute has no effect on the collation algorithm used in XQuery. Unlike SQL, the choice of collation depends entirely on the user writing the query, and not on any properties of the data.

Unicode Normalization

Unicode normalization allows text to be compared without regard to subtle variations in character representation. It replaces certain characters with equivalent representations. Two normalized values can then be compared to determine whether they are the same. Unicode normalization is also useful for allowing character strings to be sorted appropriately.

The normalize-unicode function performs Unicode normalization on a string. It takes two arguments: the string to be normalized and the normalization form to use. The normalization form controls which characters are replaced. Some characters may be replaced by equivalent characters, while others may be decomposed to an equivalent representation that has two or more codepoints.

Determining the Language of an Element

It is possible to test the language of an element based on the existence of an xml:lang attribute among its ancestors. This is accomplished using the lang function.

The lang function accepts as arguments the language for which to test and, optionally, the node to be tested. The function returns true if the relevant xml:lang attribute of the node (or the context node if no second argument is specified) has a value that matches the argument. The function returns false if the relevant xml:lang attribute does not match the argument, or if there is no relevant xml:lang attribute.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.143.31