Internationalization Considerations

XML, through its support for Unicode, is designed to allow for many natural languages. XQuery provides several functions and mechanisms that support multiple natural languages: collations, the normalize-unicode function, and the lang function.

Collations

Collations are used to specify the order in which characters should be compared and sorted. Characters can be sorted simply based on their code points, but this has a number of limitations. Different languages and locales alphabetize the same set of characters differently. In addition, an uppercase letter and its lowercase equivalent may need to be sorted together. For example, if you sort on code points alone, an uppercase A comes after a lowercase z.

Collations are not just for sorting. They can be used to equate two strings that contain equivalent values. Some languages and locales may consider two different characters or sequences of characters to be equivalent. For example, a collation may equate the German character β with the two letters ss. This type of comparison comes into play when using, for example, the contains function, which determines whether one string contains the characters of another string.

Collations in XQuery are identified by a URI. The URI serves only as a name and does not necessarily point to a resource on the Web, although it might. All XQuery implementations support at least one collation, whose name is http://www.w3.org/2005/xpath-functions/collation/codepoint. This is a simple collation that compares strings based only on Unicode code points. Although it is based on Unicode code points, it should not be confused with the Unicode collation algorithm, which is a far more sophisticated collation algorithm.

There are several ways to specify a collation. Some XQuery functions, such as compare and distinct-values, accept a $collation argument that allows you to specify the collation URI. In addition, you can specify a collation in the order by clause of a FLWOR. These expressions accept either an absolute or a relative URI. If a relative URI is provided, it is relative to the base URI of the static context, which is described in Chapter 20.

You can also specify a default collation in the query prolog. This default is used by some functions as well as order by clauses when no $collation is specified. The default collation is also used in operations that do not allow you to specify collation, such as those using the comparison operators =, !=, <, <=, >, and >=. The syntax of a default collation declaration is shown in Figure 17-1.

Syntax of a default collation declaration

Figure 17-1. Syntax of a default collation declaration

An example is:

declare default collation "http://datypic.com/collation/custom";

The collation URI must be a literal value in quotes (not an evaluated expression), and it should be a syntactically valid absolute URI.

Alternatively, the implementation may have a built-in default collation, or allow a user to specify one, through means other than the query prolog.

As a last resort, if no $collation argument is provided, no default collation is specified, and the implementation does not provide a default collation, the simple code-point collation named http://www.w3.org/2005/xpath-functions/collation/codepoint is used.

The default collation can be obtained using the default-collation function, which takes no arguments.

You should consult the documentation of your XQuery implementation to determine which collations are supported. Some collations may expect the strings to be Unicode-normalized already. For these collations, consider using the normalize-unicode function on strings before comparing them. Other collations perform implicit normalization on the strings.

Although it is possible in XML to use an xml:lang attribute to indicate the natural language of character data, use of this attribute has no effect on the collation algorithm used in XQuery. Unlike SQL, the choice of collation depends entirely on the user writing the query, and not on any properties of the data.

Unicode Normalization

Unicode normalization allows text to be compared without regard to subtle variations in character representation. It replaces certain characters with equivalent representations. Two normalized values can then be compared to determine whether they are the same. Unicode normalization is also useful for allowing character strings to be sorted appropriately.

The normalize-unicode function performs Unicode normalization on a string. It takes two arguments: the string to be normalized and the normalization form to use. The normalization form controls which characters are replaced. Some characters may be replaced by equivalent characters, while others may be decomposed to an equivalent representation that has two or more code points.

Determining the Language of an Element

It is possible to test the language of an element based on the existence of an xml:lang attribute among its ancestors. This is accomplished using the lang function.

The lang function accepts as arguments the language to test and, optionally, the node to be tested. The function returns true if the relevant xml:lang attribute of the node (or the context node if no second argument is specified) has a value that matches the argument. The function returns false if the relevant xml:lang attribute does not match the argument, or if there is no relevant xml:lang attribute.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.184.90