ASCII, Unicode, and the Universal Character System

The actual characters in documents are stored as numeric codes. The most common code set today is the American Standard Code for Information Interchange (ASCII). ASCII codes extend from 0 to 255 (to fit within a single byte); for example, the ASCII code for "A" is 65, the ASCII code for "B" is 66, and so on.

On the other hand, the World Wide Web is just that today: worldwide. Plenty of scripts are not handled by ASCII, such as scripts in Bengali, Armenian, Hebrew, Thai, Tibetan, Japanese Katakana, Arabic, Cyrillic, and other languages.

For that reason, the default character set specified for XML by W3C is Unicode, not ASCII. Unicode codes are made up of 2 bytes, not 1, so they extend from 0 to 65,535, not just 0 to 255. (However, to make things easier, the Unicode codes 0 to 255 do correspond to the ASCII 0 to 255 codes.) Therefore, Unicode can include many of the symbols commonly used in worldwide character and ideograph sets. You can find more on Unicode at http://www.unicode.org.

Only about 40,000 Unicode codes are reserved at this point (of which about 20,000 codes are used for Han ideographs, although more than 80,000 such ideographs are defined; 11,000 are used for Korean Hangul syllables).

In practice, Unicode support, like many parts of the XML technology, is not fully supported on most platforms today. Windows 95/98 does not offer full support for Unicode, although Windows NT and Windows 2000 come much closer (and XML Spy lets you use Unicode to write XML documents in Windows NT). Most often, this means that XML documents are written in simply ASCII, or in UTF-8, which is a compressed version of Unicode that uses 8 bits to represent characters. (In practice, this is well suited to ASCII documents because multiple bytes are needed for many non-ASCII symbols, and ASCII documents converted to Unicode are two times as long.) Here's how to specify the UTF-8 character encoding in an XML document:

<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENT>
    <GREETING>
        Hello From XML
    </GREETING>
    <MESSAGE>
        Welcome to the wild and woolly world of XML.
    </MESSAGE>
</DOCUMENT>

The default for XML processors today is to assume that your document is in UTF-8, so if you omit the encoding specification, UTF-8 is assumed. If you're writing XML documents in ASCII, you'll have no trouble.

Actually, not even Unicode has enough space for all symbols in common use. A new specification, the Universal Character System (UCS, also called ISO 10646) uses 4 bytes per symbol, which gives it a range of two billion symbols, far more than needed. You can specify that you want to use pure Unicode encoding in your XML documents by using the UCS-2 encoding (also called ISO-10646-UCS-2), which is compressed 2-byte UCS. You also can use UTF-16, which is a special encoding that represents UCS symbols using 2 bytes so that the result corresponds to UCS-2. Straight UCS encoding is referred to as UCS-4 (also called ISO-10646-UCS-4).

I'll stick to ASCII for most XML documents in this book because support for Unicode and UCS is not yet widespread. For example, I know of no true Unicode editors. On the other hand, you can write documents in a local character set and use a translation utility to convert them to Unicode, or you can insert the actual Unicode codes directly into your documents. For example, the Unicode for is 03C0 in hexadecimal, so you can insert into your document with the character entity (more on entities in the next chapter) &#x03C0;.

More character sets are available than those mentioned here; for a longer list, take a look at the list posted by the Internet Assigned Numbers Authority (IANA) at http://www.isi.edu/in-notes/iana/assignments/character-sets.

Converting ASCII to Unicode

If you want to convert ASCII files to straight Unicode, you can use the native2ascii program that comes with Sun Microsystem's Java Software Development Kit (the SDK, formerly the JDK). Using this tool, you can convert to Unicode like this: native2ascii file.txt file.uni. You also can convert to a number of other encodings besides Unicode, such as compressed Unicode, UTF-8.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.113.155