Unicode Escapes

Currently, there isn’t a large installed base of Unicode text editors. There’s an even smaller installed base of machines with full Unicode fonts installed. Therefore, it’s essential that all valid Java programs can be written using nothing more than ASCII characters.

All Java keywords and operators as well as the names of all the classes, methods, and fields in the core API may be written in pure ASCII. This is by deliberate design on the part of JavaSoft. However, Unicode characters are explicitly allowed in comments, string and char literals, and identifiers. The following, the opening line from Homer’s Odyssey, should be legal Java:

Unicode Escapes

To enable statements like that in Java source, non-ASCII characters are embedded through Unicode escape sequences. The escape sequence for a character is a backslash ( ) followed by a small u, followed by the four-digit hexadecimal code for the character. For example:

char tab = 'u0009';
char softHyphen = 'u00AD';
char sigma = 'u03C3';
char squareKeesu = 'u30B9';.

Using Unicode escapes, the opening line from Homer’s Odyssey would be rendered as:

/* u039Fu03B4u03C5u03C3u03C3u03B5u03B9u03B1 */
String u03B1u03C1u03C7u03B7 = 
 "u0386u03BDu03B4u03C1u03B1 u03BCu03BFu03B9 "
 + "u03ADu03BDu03BDu03B5u03C0u03B5, " 
 + "u039Cu03BFu03C5u03C3u03B1, " 
 + " u03BFu03C2 u03BCu03ACu03BBu03B1 u03C0u03BFu03BBu03BBu03B1";

Obviously, this is horribly inconvenient for anything more than an occasional non-ASCII character.

Many Java compilers assume that source files are written in ASCII and that the only Unicode characters present are Unicode escapes. During a single-pass preprocessing phase, the compiler converts each raw ASCII character or Unicode escape sequence to a two-byte Unicode character it stores in memory. Only after preprocessing is complete and the ASCII file has been converted to in-memory Unicode, is the file actually compiled. Some compilers and runtimes will also compile the upper 128 characters of the ISO Latin-1 character set. However, some do not. Worse yet, some Java virtual machines can compile files containing non-ASCII, ISO Latin-1 characters but can’t run the files they’ve compiled. For safety’s sake and maximum portability, you should escape all non-ASCII characters.

Version 1.1 and later of Sun’s javac compiler assumes a .java file is written in the platform’s default encoding, which is Latin-1 on Solaris and Windows, MacRoman on the Mac. However, this produces incorrect results on Windows, because Windows does not use true Latin-1 but a modified version that includes fewer control characters and more printing characters.

Text editors that work with non-ASCII character sets like MacRoman, Arabic, or Big-5 Chinese can integrate with existing Java compilers by providing a preprocessing phase where the natively encoded data is translated to Unicode-escaped ASCII before being passed to Sun’s javac compiler. Alternately, they can hand off the translation work to javac (1.1 and later) by using its -encoding flag. For example, to specify that the file MyClass.java is written in the ISO 8859-9 character set (essentially Latin-1 with the Turkish characters Unicode Escapes, Unicode Escapes, Unicode Escapes, Unicode Escapes, Unicode Escapes, and Unicode Escapes replacing the Icelandic characters þ, Þ, ý, Ý, Ð, and ð) you would type:

% javac -encoding 8859_9 MyClass.java

Table 2.4 lists the encodings that Java 1.1 understands.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.252