Currently, there isn’t a large installed base of Unicode text editors. There’s an even smaller installed base of machines with full Unicode fonts installed. Therefore, it’s essential that all valid Java programs can be written using nothing more than ASCII characters.
All Java keywords and operators as well as the names of all the
classes, methods, and fields in the core API may be written in pure
ASCII. This is by deliberate design on the part of JavaSoft. However,
Unicode characters are explicitly allowed in comments, string and
char
literals, and identifiers. The following, the
opening line from Homer’s Odyssey,
should be legal Java:
To enable statements like that in Java source, non-ASCII characters
are embedded through Unicode escape sequences. The escape sequence
for a character is a backslash ( ) followed by a small
u
, followed by the four-digit hexadecimal code for
the character. For example:
char tab = 'u0009'; char softHyphen = 'u00AD'; char sigma = 'u03C3'; char squareKeesu = 'u30B9';.
Using Unicode escapes, the opening line from Homer’s Odyssey would be rendered as:
/* u039Fu03B4u03C5u03C3u03C3u03B5u03B9u03B1 */ String u03B1u03C1u03C7u03B7 = "u0386u03BDu03B4u03C1u03B1 u03BCu03BFu03B9 " + "u03ADu03BDu03BDu03B5u03C0u03B5, " + "u039Cu03BFu03C5u03C3u03B1, " + " u03BFu03C2 u03BCu03ACu03BBu03B1 u03C0u03BFu03BBu03BBu03B1";
Obviously, this is horribly inconvenient for anything more than an occasional non-ASCII character.
Many Java compilers assume that source files are written in ASCII and that the only Unicode characters present are Unicode escapes. During a single-pass preprocessing phase, the compiler converts each raw ASCII character or Unicode escape sequence to a two-byte Unicode character it stores in memory. Only after preprocessing is complete and the ASCII file has been converted to in-memory Unicode, is the file actually compiled. Some compilers and runtimes will also compile the upper 128 characters of the ISO Latin-1 character set. However, some do not. Worse yet, some Java virtual machines can compile files containing non-ASCII, ISO Latin-1 characters but can’t run the files they’ve compiled. For safety’s sake and maximum portability, you should escape all non-ASCII characters.
Version 1.1 and later of Sun’s javac
compiler assumes a .java
file is written in the
platform’s default encoding, which is Latin-1 on Solaris and
Windows, MacRoman on the Mac. However, this produces incorrect
results on Windows, because Windows does not use true Latin-1 but a
modified version that includes fewer control characters and more
printing characters.
Text editors that work with non-ASCII character sets like MacRoman,
Arabic, or Big-5 Chinese can integrate with existing Java compilers
by providing a preprocessing phase where the natively encoded data is
translated to Unicode-escaped ASCII before being passed to
Sun’s javac compiler. Alternately, they
can hand off the translation work to javac (1.1
and later) by using its -encoding
flag. For
example, to specify that the file MyClass.java
is written in the ISO 8859-9 character set (essentially Latin-1 with
the Turkish characters ,
, , ,
, and replacing the Icelandic characters
þ, Þ, ý,
Ý, Ð, and ð) you would type:
% javac -encoding 8859_9 MyClass.java
Table 2.4 lists the encodings that Java 1.1 understands.
3.145.179.252