Extended ASCII

With the greater reliability of modern computer systems and networks, a check-bit is no longer needed, so an 8-bit character set is no longer unsafe. Using the extra bit actually doubles the number of values available, giving an extra 128 character values.

There was originally no agreement between system vendors on which extra characters to include, and which values to assign to those characters that they happen to have in common. For example, the Apple Macintosh OS uses the value 142 to represent the character 'é', whereas the same character is given a value of 130 in one configuration of an MS-DOS system. Therefore, a Macintosh text file should not be copied to an MS-DOS system without passing it through a filter that moves the extra characters to their correct positions in the latter's character set. The problem that ASCII was originally intended to resolve thus returned.

ISO 8859

The problem of incompatible 8-bit encodings was addressed by ISO 8859, which defines (amongst others) a standard Latin alphabet character set (ISO 8859/1) consisting of 255 characters, currently used by Web browsers, Microsoft Windows and Sun OS UNIX (a chart of this character set is shown in Chapter 33).

These sets were designed in the 1980s by ECMA. Backward compatibility with ISO 646 is maintained, so if the high bit of every byte in a 8859 document is zero, it automatically counts as a 646 document as well. Conversely, every 646 document is automatically also an 8859/1 document as well:



The '/1' part of the name identifies just one variant of 8859 (the standard Latin alphabet). There are other variants that include different accented characters, to cover the requirements of various languages:

ISO 8859/x Languages covered
1 ISO Latin-1. ASCII characters plus Danish, Dutch, English, Faroese, Finnish, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish characters.
2 ISO Latin-2. ASCII characters plus Croatian, Czech, Hungarian, Polish, Romanian, Slovak and Slovenian characters.
3 ISO Latin-3. ASCII characters plus Esperanto, Maltese, Turkish (though 8859/5 is now preferred for this language) and Galician characters.
4 ISO Latin-4. ASCII characters plus Latvian, Lithuanian, Greenlandic and Lappish.
5 ASCII characters plus Cyrillic characters to cover Byelorussian, Bulgarian, Macedonian, Russian, Serbian and Ukrainian.
6 ASCII characters plus Arabic.
7 ASCII characters plus modern Greek.
8 ASCII characters plus Hebrew.
9 ISO Latin-5. As Latin-1 except six Turkish characters replace six Icelandic letters.
10 ISO Latin-6. ASCII characters plus Lappish, Nordic and Inuit.

Despite the existence of the more powerful standards described later, work still continues on yet more variants of 8859.

Japanese

Probably the most significant non-Roman alphabet in use today on the Web is Japanese. There have been several attempts to define encodings for this language, including ISO 2022-JP (which includes ASCII as a subset, so XML tags can still look the same), JP-EUC and JIS X 0201-1997.

On UNIX platforms, JP-EUC has been dominant, but under Microsoft Windows Shift_JIS is more popular.

Limitations

Even 8-bit character sets are very limited in scope, as they can only directly handle 256 characters. The obvious solution is to use notations that use more than one byte to hold each character.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.107.96