Chapter 17

Internationalization: Putting the “W” in “WWW”

17.1 Introduction

The organization that developed the XML standard,1 the namespaces standard, XML Schema, XSLT, and XQuery is named “World Wide Web Consortium.” A great deal of emphasis has been, and continues to be, given to the first words: World Wide. The Director of the W3C, Tim Berners-Lee, is adamant, as is the entire staff, that the scope of the web must be world wide. As a result, the policies and practices of the W3C demand that the Recommendations developed by its Working Groups be written in a way that serves the entire world, not merely Americans and Western Europeans.

One of the most important consequences of those policies and practices is that the specifications written by the various Working Groups have to carefully consider implications of many cultures, some of which might not even be recognized when the specs are under development. In this chapter, we’ll explore some of the ways in which the work of the W3C is affected by the desire to serve the entire world, particularly as it applies to the subject of this book: querying XML.

17.2 What Is Internationalization?

The process of ensuring that a specification or a product can be used by any culture, using any script (writing system) or language, is commonly called internationalization, often abbreviated as “I18n” (that is, the letter i followed by 18 letters, followed by the letter n). (By contrast, the process of customizing a specification or product to be optimal in a specific culture, set of cultures, geographic regions, languages, and so forth is known as localization, or “L10n.”)

The concept of internationalization is sometimes difficult for “westerners” (by which we mean members of cultures established primarily by people whose family origins lie in western Europe) to fully grasp. In fact, the computer industry has struggled with the many components of that concept for many years. A number of different factors are involved in any culture’s world view, and most of those are necessarily involved in making computer systems, communications systems, etc. Here’s a list of some of the more obvious items (and some not-so-obvious ones):

• The language(s) spoken.

• The script(s), or writing system(s), used to represent the language(s), including the character set(s) and writing direction(s).

• The rules for comparing and ordering sequences of characters.

• The conventions for “spelling” written forms of the language(s).

• The notation(s) used to write dates and times.

• The time zone(s) in use and the rules for adjusting them by season.

• The notations used to write numbers (e.g., decimal marker, thousands separator).

• The conventions for writing words, sentences, and paragraphs (including, for example, whether white space is used to separate words and whether the first lines of paragraphs are indented).

• The ways in which currency (“money”) is represented in writing.

• Units used for measures (e.g., metric vs. “Imperial”).

Obviously, that list is far from complete, but it should be sufficient to give you a sense of the scope of the problems involved in making computer systems and languages equally accessible to all cultures. Over the decades, a great deal of work has been done in pursuit of internationalization of hardware and software products, of specifications and standards, of applications, and of operating systems. Different (computer industry) communities have taken many different approaches, sometimes leveraging concurrent or preceding work and sometimes conflicting with it.

To make matters worse, software in particular is often developed under severe time and scope pressures, leaving developers insufficient resources to think about and address aspects for which they have relatively little appreciation. For example, the average American software engineer in the 1980s had little knowledge of languages and scripts used in the Far East (such as China, Japan, Korea, and Vietnam)2 or the problems with representing text in languages where vowels and consonants are treated differently (such as Semitic, Sanskrit, and Dravidian languages). Consequently, systems developed at that time tended to have a strong English, or at least Latin script-based, orientation.

Even today, a significant amount of discipline is required by developers to keep in mind the needs of unfamiliar cultures. A particular problem is that commercial, proprietary software tends to be written to support only those cultures that are sufficiently important economically – that is, where the software will return enough revenue to pay for the process of making the software useful in those cultures.3

17.3 Internationalization and the World Wide Web

The W3C has established a Working Group, appropriately called the Internationalization WG, to monitor the work of the W3C – especially the development of Recommendation-track specifications – to ensure that the work truly addresses the needs of the world. It’s illustrative to read the mission statement4 of the I18n WG:

To enable universal access to the World Wide Web by proposing and coordinating the adoption by the W3C of techniques, conventions, technologies, and designs that enable and enhance the use of W3C technology and the Web worldwide, with and between the various different languages, scripts, regions, and cultures.

Of course, concepts such as “universal access” can be interpreted more or less broadly. Clearly, neither the W3C nor its I18n WG has any significant influence on the availability of computer systems, networking facilities, or educational systems throughout the world. A more reasonable interpretation of that mission statement is easy to attain after reading a bit further: “by proposing … techniques, conventions, technologies, and designs” that help those different regions and cultures to use the web.

Not surprisingly, the W3C is not the only – perhaps not even the principal – organization with a strong commitment to internationalization. For example, the Unicode Consortium5 (about which you’ll read more in Section 17.3.1) is devoted entirely to creation of specifications that encourage the production of software that supports cultural conventions in the areas of writing. ISO, the International Organization for Standardization, does not have explicit policies that require careful internationalization of its standards, but publishes many standards in support of internationalization as well as many other standards that include internationalization components.

In this section, we discuss the two most important (in our opinions) specifications that drive internationalization today. The first of these (Unicode) defines a character set intended to cover every writing system and culture of any importance at any time in human history, as well as a number of other factors that a computer system must address in order to properly support those writing systems and cultures. The other (the W3C Character Model for the World Wide Web) defines a model for the transmission and manipulation of character data on the web.

17.3.1 Unicode

Unicode is a character set defined by The Unicode Consortium and published as The Unicode Standard.6 The goal of Unicode is “to remedy two serious problems common to most multilingual computer programs. The first problem was the overloading of the font mechanism7 when encoding characters … [the] second major problem was the use of multiple, inconsistent character codes because of conflicting national and industry character standards.”

Unicode is, for all practical purposes, identical to an international standard known as the Universal Character Set (or UCS).8 UCS (which, interestingly enough, is more popularly called by its number: 10646) evolved from a simple, elegant model in which a complete 32-bit space was dedicated to character encoding, allowing over 4 billion characters – slight overkill, in the estimation of many observers!

The space was partitioned into 256 “groups” of 256 “planes” each; each plane provided 65,536 (that is, 256 × 256) positions in which characters could be encoded. Group 0, plane 0 was designated to be the multilingual plane, in which the characters most widely used in the most widely-used languages would be encoded, while the remaining 255 planes in group 0 would be used for encoding less common characters. Limiting the character encoding space to group 0 provided 16,777,216 positions in which characters could be encoded – even that number was thought to be a bit much.

Concurrently with the initial development of UCS, the Unicode Consortium was established under the premise (some thought it a dubious premise) that only the multilingual plane was required for practical use in information technology applications and that the advantages to programmers of having “fixed-width” character encodings whose number of bytes was a power of two (e.g., 21 = 2 or 22 = 4) were considerable.

Over time, both standards development groups realized that their basic premises were flawed. The Unicode Consortium recognized that the world’s current languages (not to mention historical languages and scripts) could not be satisfied with only 65,536 characters, while the ISO committees realized that their (arguably more pure) 32-bit model was unnecessarily large. The groups decided to form a coalition in which supporters of the de jure process (such as national governments) would be presented with an international standard whose development was guided by an organization with full-time staff dedicated to continued evolution of the underlying character set and associated concepts.

The coalition of the two groups incorporated a formal agreement that the character set standard being published by each of the groups would be identical (at least to the degree that they contained exactly the same set of characters that are encoded at identical positions) and that the groups would work together to ensure that the standards remained identical. There are, however, slight differences between the two standards; for example, ISO/IEC 10646 defines an encoding form (UCS2) that is not part of The Unicode Standard.

In the end, the Unicode Consortium allocated a total of 16 planes beyond the multilingual plane for encoding characters. That space provides a total of 17 × 65,536 (that is, 1,114,112) positions. Of those positions, a small number have been set aside as “not a character.”

Figure 17-1 illustrates the original ISO/IEC 10646 character encoding space, a variation reduced to a single group of 256 planes, and the final Unicode encoding space (adopted by the definers of ISO/IEC 10646) of 17 planes.

image

Figure 17-1 UCS and Unicode character encoding space.

Unicode (and, by extension, UCS) undertook to create a repertoire of characters for every known script and language. Of course, it can never be proven is that work is complete, because new scripts and even new languages are discovered every now and then. While it might seem straightforward to create such a repertoire, reserving lots of space for characters yet to be discovered, there are complications.

One of these is the fact that a great many scripts depend on “decorations” being associated with certain characters. For example, Semitic languages are written primarily with characters representing consonants while the vowels are usually omitted, but represented as decorations printed “near” the consonants with which they are associated. Similarly, languages such as French and German depend on marks, often called “accents” that (when displayed or printed) appear above the characters they modify. As one can easily imagine, the number of possible combinations of characters and related marks is very large indeed. Not all of those combinations are in actual use, but languages such as Vietnamese depend on the ability to place several such marks on individual characters.

The “obvious” solution to this issue was to encode “base characters” and their decorations separately. Therefore, the German character u-umlaut (ü) would be encoded as two separate characters, the letter “u” and a “nonspacing” umlaut. While that might be considered an elegant solution, it failed to account for the ability that most Westerners were used to encoding common characters in their languages using just one code position. For example, a very common character encoding standard used in Germany, known as Latin 1,9 includes u-umlaut as a single character.

Consequently, a guiding principle of Unicode was to incorporate existing widely-used character coding standards into Unicode as completely as possible. Thus, Latin 1 appears in Unicode as a contiguous sequence of characters in the same sequence that they appeared in the Latin 1 standard. Characters that are a base character plus an accent or other decoration, such as “ü,” encoded at a single position, are known as composed (sometimes called “precomposed”) characters. Unicode also permits such characters to be encoded as two separate characters (the “u” and the umlaut); when they are encoded in that manner, they are called decomposed characters.

The Unicode Standard defines several forms of normalization, including canonical composition, in which every character that can be composed into a single codepoint is represented that way, and canonical decomposition, in which all characters are decomposed into their base character and separate decorations.

In addition to normalization, the Unicode Standard defines several ways to encode Unicode into a sequence of bytes. Two of these are in such wide use that we briefly describe them here. We must emphasize this: Unicode is a character set, the formats we describe next are still Unicode – they’re just ways to represent Unicode in byte sequences.

The first, called UTF-8 (Unicode Transformation Format, 8-bit form), is a variable-width encoding, meaning that the encoding of different Unicode characters might require a different number of octets. UTF-8 was designed so that all of the original ASCII10 characters are encoded in one octet (byte). However, characters other than 7-bit ASCII characters require more than octet, up to four octets for characters used in Japanese, Chinese, and Korean scripts.

The other widely used encoding form is called UTF-16 (Unicode Transformation Format for 16 Planes of Group 00). In UTF-16, all characters that are encoded on the multilingual plane are represented in exactly 16 bits, or two octets. In addition, a total of 2048 positions of the multilingual plane are reserved for surrogates. Surrogates, which are valid only in UTF-16, are used to represent the 1024 × 1024 (1,048,576) character positions that are not part of the multilingual plane – in other words, the positions on the additional 16 planes. Those characters are represented in four octets: two octets that identify one surrogate value and two that identify a second surrogate value. Surrogates are also not characters.

It is not always obvious which of these encoding formats, UTF-8 or UTF-16, should be used for any particular situation. There’s a definite trade-off in terms of the amount of space required to represent any given character string.

If your data uses only ASCII characters, then UTF-8 is most certainly the right choice, since every character is encoded in a single byte. However, if your data contains characters other than those in ASCII, UTF-8 begins to require more space: A great many nonideographic characters are represented in UTF by two bytes, all characters of the multilingual plane are represented in no more than three bytes, and all characters from the other 16 planes are represented in four bytes. If your data contains only those ASCII characters and those that can be represented in two bytes, then the average number of bytes required for any given text string will be somewhere between one and two bytes, inclusive. When you start using the less-common characters, especially those encoded outside the multilingual plane, the average number of bytes per character grows, increasing the space required to store the data and the time required to transmit it.

If your data comprises characters on the multilingual plane beyond those represented in ASCII, UTF-16 is a good choice. While UTF-8 is a varying-length encoding (that is, the number of bytes to represent a character varies depending on the specific character), UTF-16 is a fixed-length encoding – as long as you stick to the multilingual plane, every character occupies exactly two bytes. If your data includes some characters from the 16 supplementary planes, those characters are represented in UTF-16 by two consecutive 16-bit surrogates, four bytes in all – thus making UTF-16 a varying-length encoding in this situation. Consequently, the average number of bytes per character needed to represent text strings in UTF-16 ranges between two and four, inclusive.

Which is better for your environment? The answer to that question depends entirely on the nature of your data. If it is primarily data used in the English-speaking world, then UTF-8 may well be the best choice. But if your data includes a great many characters from non-English speaking cultures, particularly ideographic characters, then UTF-16 is likely to be more efficient.

One more thing about encoding forms: ISO/IEC 10646 defines two encoding forms of its own, both of them fixed-length. The first, UCS2, represents every character in exactly two bytes. That obviously limits its repertoire of characters to those encoded on the multilingual plane. UCS2 has no mechanism for representing characters on the 16 supplementary planes, because it does not support the concept of surrogates. (That is the only difference between UCS2 and UTF-16, but it’s an important difference.) The other encoding form is UCS4, which represents every character in exactly four bytes; Unicode defines an encoding form called UTF-32 that is identical to UCS4.

The Unicode Consortium has, of course, not rested on its laurels. It continues to release versions of The Unicode Standard, adding new scripts and new characters for existing scripts, refining the designated characteristics of existing and new characters, and specifying elegant mechanisms for defining collations,11 defining normalized forms of character strings that maximize communication between processes, and so forth. Some of these additional mechanisms are published as Unicode Technical Reports (UTR), others are published as Unicode Technical Standards (UTS), and others are incorporated directly in The Unicode Standard as Unicode Standard Annexes (UAX).

Unicode (and UCS) is a resounding success and has proven to be one of the more important underpinnings of the World Wide Web. In particular, the character set on which both HTML and XML are defined is Unicode. Consequently, specifications that are based on, or depend on, XML are also defined in terms of Unicode. This includes W3C specifications discussed in this book, such as XML Schema, XSLT, and XQuery. In Section 17.4, we discuss some of the implications of these relationships.

We think it’s worth noting that, in addition to Unicode’s use in XML-related specifications, the Java Programming Language uses Unicode in the UCS2 encoding form mentioned above as its character set, and a number of Microsoft’s data management and web-related products do the same. While UCS2 is not a precise match for UTF-16, the two encoding forms represent characters on the multilingual plane identically. Java’s and Microsoft’s use of Unicode in UCS2 makes them just that much more useful in developing your XML queries.

17.3.2 W3C Character Model for the World Wide Web

In early 1999, the W3C began development of a character model for the World Wide Web.12 The development process was unusually painful, in large part because of the difficulties in getting “buy-in” from all interested parties. The character model was intended to provide definitions and specifications related to character sets and character strings that can be referenced by other specifications. In particular, the model defined the concept of character string normalization and proposes rules for comparing character strings and identifying specific characters in a string by their position, as well as conventions for representation of URIs.

For almost five years, the Character Model specification went through several Working Draft iterations, during which time the Internationalization Working Group responded to comments from other W3C Working Groups and from the general public. When it became apparent that some parts of the Character Model were so controversial that still more delays would be incurred in its publication, the Internationalization WG decided in early 2004 to split the specification into two documents: Fundamentals and Normalization. Shortly thereafter, a third document was split off from the other two. The three documents that exist at the time of writing are the Fundamentals,13 Normalization,14 and Resource Identifiers.15 Additional documents related to the Character Model may be created in the future.

Fundamentals states:

… the goal of the Character Model for the World Wide Web is to facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access. One basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way.

Perhaps the most important requirement specified in the Fundamentals is that all specifications adhering to the Character Model must define text in terms of Unicode characters and not in terms of the glyphs (shapes) that might appear on a screen or on paper. This requirement applies even when the text is originally represented in some (possibly proprietary, possibly standardized) character encoding that is visually oriented (that is, encoded based on the sequence of glyphs that a human reads on paper or on a screen).

Another important specification in Fundamentals are the rules for string indexing – that is, for determining the index, or the number of the position, of a given character in a character string. For example, in the character string “XML is popular,” the character in the third position is “L.”

While the other two documents have – at the time of writing – achieved Recommendation and Proposed Recommendation status, respectively, the Normalization document is still at the Working Draft stage. The controversy that caused the development of the Character Model to take so long is apparently related to normalization of character strings.

The Normalization spec defines two facilities: character normalization and string identity matching. Character normalization is described as bringing the characters in a string into a well-defined canonical encoding, such as canonically composed or canonically decomposed. Normalization is important because of problems associated with comparison of character strings. Virtually all computer systems compare two character strings in a byte-by-byte style, in which the corresponding bytes in the strings are compared (usually “from left to right,” more properly described as “from most significant to least significant”).

Consider two character strings being compared, in which one contains the German word “stück” in which the “ü” character is represented as a single code position and the other contains the same word, but with the “ü” represented as two code positions (the letter and the umlaut). The first string is encoded as five characters, while the second is encoded as six. The comparison of the strings would initially compare the first characters, which would compare equal (“s” = “s”). The second characters would also compare equal (“t” = “t”). But the third characters would not compare equal (“ü” is not the same as “u”!), nor would the fourth characters (“c” is not the same as a combining umlaut).

In order to reliably compare two character strings, they must be represented in the same character encoding scheme, but they must also be represented in the same normalization form. This requirement often means that one character string must be converted to the normalization form of the other; to be certain, it is often desirable to convert both strings to the chosen normalization form (this ensures against some characters in a single string being represented in their composed form and others in their decomposed form).

Now, here’s where the controversy arose: The Character Model Normalization draft specifies that normalization is to be performed “early” – that is, by the producers of data, and prohibits the consumers of that data from performing the normalization. In order for comparisons to be dependable, all data producers on the World Wide Web would have to normalize the character strings they produce, and do so in exactly the same normalization form (the Character Model specifies the canonically composed form named NFC by Unicode). Many participants in other W3C Working Groups agreed that, in an ideal world, that would be how things worked. But, they argued, the world is far from ideal and the web is filled with documents of uncertain normalization; it was thus more effective for the consumers of character string data to perform normalization when (and only when!) normalization was required. Because normalization (to a given normalization form, such as NFC) is idempotent, it never causes problems to renormalize a string that has already been normalized.

The controversy rages on, with a lot of creativity being used to attempt to find solutions (sometimes in the form of rather clever definitions) that will allow all parties’ requirements to be satisfied.

The other facility defined in the Normalization spec is string identity matching. The spec defined the term to mean that two strings being matched for string identity have no user-identifiable distinctions. For example, the strings “The Thing” and “The thing” have a user-identifiable distinction: the first contains a capital “T,” while the second has a lowercase “t” in the corresponding position. The spec defines the concept such that strings do not match when they differ in case or accentuation, but do match when they differ only in nonsemantically significant ways, such as character encoding, use of character escapes (of potentially different kinds), or use of precomposed vs. decomposed character sequences.

We are personally quite impressed with the Character Model (even though we disagree with the Normalization draft’s specification that normalization is to be performed early). The Character Model provides some perspectives, definitions, guidelines, and policies that are proving very helpful in making the World Wide Web truly universal.

17.4 Internationalization Implications: XPath, XQuery, and SQL/XML

At the end of Section 17.3.1, we told you that the choice of using Unicode as the character set for XML would have implications on all specifications and products that use or depend on XML. That certainly applies to the technologies we’ve discussed in this book for querying XML, obviously including XPath16 and XQuery;17 perhaps a little less obviously, SQL/XML18 is also affected. In this section, we look at some of those implications.

The first, and most obvious, implication is that XPaths and XQuery expressions must be able to specify data used for comparisons and matching in XML. For example, in the XPath /movies/movie[title=“Starship Troopers”], the title element child of each movie element is, of course, XML and thus encoded in Unicode. Consequently, the comparison of the value of that title element with the literal “Starship Troopers” is done using Unicode comparison rules. As a result, the literal itself should also be encoded in Unicode. The W3C Working Groups responsible for XPath and XQuery made the obvious decision that the character set for both XPath and XQuery is Unicode. (Note that, in specifications for all Unicode-based languages, including XML, the specific Unicode encoding form – e.g., UFT-8, UTF-16, or anything else – is irrelevant.19 Implementations of those languages must, of course, deal with the actual encoding forms.)

A second, rather more subtle, implication relates to a few of the functions described in the Functions and Operators spec.20 The functions in question (fn:matches, fn:replace, and fn:tokenize) perform matching operations using regular expressions. The Unicode Consortium has published a UTS21 that describes the issues related to regular expression mapping when the character set in use is Unicode. This UTS is primarily concerned with regular expression matching implementations, so one of the principal issues is the very large number of characters encoded by Unicode (especially compared with a standard like ASCII, which defines fewer than 100 characters).

UTS #18 defines three levels of Unicode support:

• Level 1: Basic Unicode Support requires that the regular expression (regex) engine recognize and process Unicode characters instead of bytes (which, because it works fine with ASCII, is the – inappropriate – technique used by some regex engines).

• Level 2: Extended Unicode Support requires that the regex engine recognize grapheme clusters (that is, it recognizes Unicode character sequences that have been defined to correspond to the marks that appear on paper or on a screen that most human readers would think of as a character).

• Level 3: Tailored Support is provided by regex engines that provide application-specific treatment of characters, such as that assumed by specific countries, languages, or scripts.

Among the requirements placed on regular expression engines by UTS #18 are the following.

• Some mechanism for specifying arbitrary Unicode characters must be provided (such as the notation used by XML: “㉯”).

• The engine must provide a way to reference whole categories of characters using the Unicode character properties (e.g., “letter,” “digit,” or “whitespace”). Minimally, the engine must support at least these properties: General_Category, Script, Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, ANY, ASCII, and ASSIGNED. There are 40 General_Category properties, including things such as Letter, Mark, Decimal Digit Number, Final Punctuation, and Surrogate.

A third implication of Unicode on XPath and XQuery is the definition of functions and operations either focused on Unicode semantics or influenced by Unicode semantics. The XQuery Functions and Operators specification defines a number of such functions: fn:codepoints-to-string, fn:string-to-codepoints, fn:codepoint-equal, fn:normalize-unicode, fn:upper-case, fn:lower-case, fn:contains, fn:starts-with, fn:ends-with, fn:substring-before, fn:substring-after, fn:matches, fn:replace, and fn:tokenize. In addition, all string comparison and ordering operations are dependent on the Unicode Collation Algorithm.22

The effects of Unicode on SQL/XML exist in part because of the relationship between SQL/XML and XQuery (see Chapter 16, “XML-Derived Markup Languages,” for more information). There is a second source of those effects, though, and that is the recent dependence on Unicode in SQL itself. SQL has been “internationalized” since its 1992 edition,23 at least in its recognition that ASCII (or even Latin 1) was not the only character set in use by database system customers and that culturally-appropriate collations were required by those customers. At that time, however, the first version of Unicode had just been published and it was far from certain that it would gather sufficient support from the computer industry. Consequently, SQL-92 did not explicitly recognize Unicode as a source of a model for character processing.

Later versions of SQL increasingly took notice of Unicode. Finally, in SQL:2003,24 Unicode was adopted as the model by which the SQL specifications talked about characters. Many SQL implementations explicitly use Unicode as the foundation for supporting databases using characters beyond the range included in the familiar Latin 1 character set.

17.5 Chapter Summary

In this chapter, we’ve introduced the subject of internationalization and the W3C’s support for the concept. We wrote at some length about the universal character set known as Unicode (and its de jure standard twin, ISO 10646), its success in the marketplace, and its use as the foundation for the computer industry’s efforts to become friendlier to users all over the world.

You also read about the W3C’s Internationalization Working Group’s still-emerging Character Model for the World Wide Web, including both the two component documents that have reached (or nearly reached) Recommendation status and the third document that is still at Working Draft stage, in part because of lingering controversy over its requirements. And, finally, we gave you a peek at several of the ways in which Unicode (and, by extension, the Character Model) have influenced XPath, XQuery, and even SQL/XML.


1Extensible Markup Language (XML) 1.0, third edition, W3C Recommendation (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/TR/REC-xml.

2Software developers who are concerned with these languages may be interested in this book: Kend Lunde, CJKV Information Processing (Sebastopol, CA: O’Reilly, 1998).

3While we don’t have a problem with such decisions, we do believe that virtually all software should be internationalized. After all, one never knows what countries will be the next economic giants!

4W3C Internationalization Core Working Group home page, available at: http://www.w3.org/International/core/.

5The Unicode Consortium’s website is found at http://www.unicode.org.

6The Unicode Consortium. The Unicode Standard, Version 4.1.0, defined by: The Unicode Standard, Version 4.0 (Boston, MA: Addison-Wesley, 2003), as amended by Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1) and by Unicode 4.1.0 (http://www.unicode.org/versions/Unicode4.1.0).

7For many years, it was the practice of font publishers to build fonts for different languages using the same character codes. For example, a font intended for use in western Europe or the Americas would probably be based on Latin 1 (aka ISO/IEC 8859-1). But a font for use in Israel would use the same character codes (that is, the numbers that identify each character) for Hebrew characters.

8ISO/IEC 10646:2003, Information Technology – Universal Multi-Octet Coded Character Set (UCS) – Part 1: Architecture and Basic Multilingual Plane (Geneva, Switzerland: International Organization for Standardization, 2003).

9ISO/IEC 8859-1:1998, Information Technology – 8-bit Single-byte Coded Graphic Character Sets – Part 1: Latin Alphabet No. 1 (Geneva, Switzerland: International Organization for Standardization, 1998).

10ANSI/INCITS 4-1986(R1997), Coded Character Sets – 7-bit American National Standard Code for Information Interchange (7-bit ASCII) (New York: American National Standards Institute, 1986).

11A collation is a mechanism for defining comparison and ordering of character strings. A common example of collations is “use the numeric value of the characters to compare characters.” Many examples of culturally sensitive collations exist that allow sorting according to the rules of French, German, Arabic, or Japanese.

12Character Model for the World Wide Web, W3C Working Draft (Cambridge, MA: World Wide Web Consortium, 1999). Available at: http://www.w3.org/TR/1999/WD-charmod-19990225.

13Character Model for the World Wide Web 1.0: Fundamentals, W3C Recommendation (Cambridge, MA: World Wide Web Consortium, 2005). Available at: http://www.w3.org/TR/charmod/.

14Character Model for the World Wide Web 1.0: Normalization, W3C Working Draft (Cambridge, MA: World Wide Web Consortium, 2005). Available at: http://www.w3.org/TR/charmod-norm/.

15Character Model for the World Wide Web 1.0: Resource Identifiers, W3C Candidate Recommendation (Cambridge, MA: World Wide Web Consortium, 2004). Available at: http://www.w3.org/TR/charmod-resid/.

16XML Path Language (XPath) 2.0, W3C Candidate Recommendation (Cambridge, MA: World Wide Web Consortium, 2005). Available at: http://www.w3.org/TR/xpath20/.

17XQuery 1.0: An XML Query Language, W3C Candidate Recommendation (Cambridge, MA: World Wide Web Consortium, 2005). Available at: http://www.w3.org/TR/xquery/.

18ISO/IEC 9075-14:2006 (planned), Information Technology – Database Languages – SQL – Part 14: XML-Related Specifications (SQL/XML) (Geneva, Switzerland: International Organization for Standardization, 2006).

19The specific Unicode encoding form used by an instance XML document being searched and literals used in queries on that document is irrelevant because both the XML document and the query literals are represented as XQuery Data Model instances. The Data Model represents character data in “Unicode,” but not in a specified encoding form. A common implementation technique is to transform all such character data into UTF-32 (equivalently, to UCS4) as the most generalized encoding form.

20XQuery 1.0 and XPath 2.0 Functions and Operators, W3C Candidate Recommendation, (Cambridge, MA: World Wide Web Consortium, 2005). Available at: http://www.w3.org/TR/xpath-functions/.

21Unicode Technical Standard #18, Unicode Regular Expressions, The Unicode Consortium (2005). Available at: http://www.unicode.org/reports/tr18/.

22Unicode Technical Standard #10, Unicode Collation Algorithm (The Unicode Consortium, 2005). Available at: http://www.unicode.org/reports/tr10/.

23ISO/IEC 9075:1992, Information Technology – Database Languages – SQL (Geneva, Switzerland: International Organization for Standardization, 1992).

24ISO/IEC 9075-*:2003, Information Technology – Database Languages – SQL (all parts) (Geneva, Switzerland: International Organization for Standardization, 2003).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.176.99