Chapter 14. Multilingual Character Sets and Unicode

We live on a planet on which many languages are spoken. I can walk out my front door in Brooklyn on any given day and hear people conversing in French, Creole, Hebrew, Arabic, Spanish, and languages I don’t even recognize. And the Internet is even more diverse than Brooklyn. A local doctor’s office that sets up a storefront on the Web to sell vitamins may soon find itself shipping to customers whose native language is Chinese, Gujarati, Turkish, German, Portuguese, or something else. There’s no such thing as a local business on the Internet.

However, the first computers and the first programming languages were mostly designed by English-speaking programmers in countries where English was the native language. These programmers designed character sets that worked well for English text, though not much else. The preeminent such set is ASCII. Since ASCII is a seven-bit character set, each ASCII character can easily be represented as a single byte, signed or unsigned. Thus, it’s natural for ASCII-based programming languages to equate the character data type with the byte data type. In these languages, such as C, the same operations that read and write bytes also read and write characters.

Unfortunately, ASCII is inadequate for almost all non-English languages. It contains no cedillas, umlauts, betas, thorns, or any of the other thousands of non-English characters that are used to read and write text around the world. Fairly shortly after the development of ASCII, there was an explosion of extended character sets around the world, each of which encoded the basic ASCII characters as well as the additional characters needed for another language like Greek, Turkish, Arabic, Chinese, Japanese, or Russian. Many of these character sets are still used today, and much existing data is encoded in them.

However, these character sets are still inadequate for many needs. For one thing, most assume that you only want to encode English plus one other language. This makes it difficult for a Russian classicist to write a commentary on an ancient Greek text, for example. Furthermore, documents are limited by their character sets. Email sent from Morocco may become illegible in India if the sender is using an Arabic character set but the recipient is using Devanagari.

Unicode is an international effort to provide a single character set that everyone can use. Unicode supports the characters needed for English, Arabic, Cyrillic, Greek, Devanagari, and many others. Unicode isn’t perfect. There are some omissions, especially in the ideographic character sets for Chinese and Japanese, but it is the most comprehensive character set yet devised for all the languages of planet Earth.

Java is one of the first programming languages to explicitly address the need for non-English text. It does this by adopting Unicode as its native character set. All Java chars and strings are given in Unicode. However, since there’s also a lot of non-Unicode legacy text in the world, in a dizzying array of encodings, Java also provides the classes you need to read and write text in these encodings as well.

Unicode

Unicode is Java’s native character set. Each Unicode character is a two-byte, unsigned number with a value between and 65,535. This provides enough space for characters from all the world’s alphabetic scripts and the most common characters from the ideographic scripts of Chinese and Japanese. The current version of Unicode (2.1) defines 38,887 different characters from many languages, including English, Russian, Arabic, Hebrew, Greek, Thai, Korean, and Sanskrit. The most common ideographic characters from Japanese and Chinese are also included. However, Chinese alone contains over 80,000 different ideograms, so it’s impossible to include them all in a two-byte set. A four-byte Universal Character Set (UCS) that will include the full Chinese and Japanese scripts is under development. Java does not yet support UCS.

The first 128 Unicode characters (characters through 127) are identical to the ASCII character set. 32 is the ASCII space; therefore, 32 is the Unicode space. 33 is the ASCII exclamation point, so 33 is the Unicode exclamation point, and so on. Table 2.1, in Appendix B, shows this character set. The next 128 Unicode characters (characters 128 through 255) have the same values as the equivalent characters in the Latin-1 character set defined by ISO standard 8859-1. Latin-1, a slight variation of which is used by Windows, adds the various accented characters, umlauts, cedillas, upside-down question marks, and other characters needed to write text in most Western European languages. Table 2.2 shows these characters. The first 128 characters in Latin-1 are identical to the ASCII character set.

Values beyond 255 encode characters from various other character sets. Where possible, character blocks describing a particular group of characters map onto established encodings for that set of characters by simple transposition. For instance, Unicode characters 884 through 1011 encode the Greek alphabet and associated characters like the Greek question mark (;).[27] This is a direct transposition by 756 of characters 128 through 255 of the ISO 8859-7 character set, which is in turn based on the Greek national standard ELOT 928. For example, the small letter delta, δ, ISO 8859-7 character 228, is Unicode character 984. A small epsilon, ε, ISO 8859-7 character 229, is Unicode character 985. In general, the Unicode value for a Greek character equals the ISO 8859-7 value for the character plus 756. Other character sets are included in Unicode in a similar fashion whenever possible.[28]

NextStep, BeOS, MacOS X Server, Bell Labs’ Plan 9, and Windows NT 4.0 all support Unicode to some extent. Unicode support in MacOS and Windows 98 is more nascent, but it’s coming. Application software is a little slower to appear, but Microsoft Word 97 and 98, Netscape Navigator 4.0, and Internet Explorer 4.0 all support Unicode. The big hold-up on most systems is fonts and input methods. Windows NT 5.0 will include fonts covering most of the defined Unicode characters as well as input methods for most major languages.



[27] Indeed, the Greek question mark is nearly identical to a Latin semicolon; this is not a mistranslation of the character.

[28] As much as I’d like to include complete tables for all Unicode characters, if I did so, this book would be little more than that table. For complete lists of all the Unicode characters and associated glyphs, the canonical reference is The Unicode Standard, Version 2.0, by the Unicode Consortium, ISBN 0-201-48345-9. Online versions of the character tables can be found at http://unicode.org/charts/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.156.231