Character set declarations

An XML processor is informed of the character set used in an XML document via the encoding parameter of the XML declaration. Valid names are listed at www.isi.edu/in-notes/iana/assignments/character-sets:

<?xml ... encoding="ISO-8859-1" ... ?>

<?xml ... encoding="UTF-8" ... ?>

<?xml ... encoding="Shift_JIS" ... ?>

<?xml ... encoding="ISO-2022-JP" ... ?>

Defaults

The encoding parameter is not required. An XML document is assumed to be encoded in UTF-8 if the first four bytes of the file have the values '3C 3F 78 6D' (in hexadecimal notation), which represent the characters '<?xm'. If these characters are not present, then UTF-8 is still assumed.

If UTF-16 is in use, the characters expected at the start of an XML document could be represented by the values '00 3C 00 3F 00 78 00 6D', as might be expected, or equally by the byte-reversed values '3C 00 3F 00 78 00 6D 00'. In order to resolve this lo-byte/hi-byte ordering problem, which is caused by microprocessor design constraints, the first two bytes of any UTF-16 file must be reserved for use as a byte order mark. The byte order mark should be 'FF FF' for the first ordering shown above, and 'FF FE' for the latter.

Explicit declarations

A declaration is required for formats other than UTF-8 and UTF-16. A document that conforms to ISO 646 can 'pretend' to be a UTF-8 file, as it is a valid subset, but an ISO 8859 document must be explicitly identified:

<?xml ... encoding="ISO-8859-1" ... ?>

Note that a variant of this tag, the Encoding Processing Instruction, should appear at the top of each external entity, where it is used to identify entities with a different character encoding to the main file:

<?xml encoding="EUC-JP" ?>

MIME types

If an XML document is delivered over the Internet, the encoding scheme in use may be determined from the MIME header. When the MIME type is 'text/xml', the character encoding is provided in the 'charset' parameter of the MIME header.

XHTML declarations

In XHTML, the Meta element is used to specify character sets:

<meta http-equiv="content-type" content="text/xhtml;
      charset=EUC-JP" />

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.105.124