An XML processor is informed of the character set used in an XML document via the encoding parameter of the XML declaration. Valid names are listed at www.isi.edu/in-notes/iana/assignments/character-sets:
<?xml ... encoding="ISO-8859-1" ... ?> <?xml ... encoding="UTF-8" ... ?> <?xml ... encoding="Shift_JIS" ... ?> <?xml ... encoding="ISO-2022-JP" ... ?>
The encoding parameter is not required. An XML document is assumed to be encoded in UTF-8 if the first four bytes of the file have the values '3C 3F 78 6D' (in hexadecimal notation), which represent the characters '<?xm'. If these characters are not present, then UTF-8 is still assumed.
If UTF-16 is in use, the characters expected at the start of an XML document could be represented by the values '00 3C 00 3F 00 78 00 6D', as might be expected, or equally by the byte-reversed values '3C 00 3F 00 78 00 6D 00'. In order to resolve this lo-byte/hi-byte ordering problem, which is caused by microprocessor design constraints, the first two bytes of any UTF-16 file must be reserved for use as a byte order mark. The byte order mark should be 'FF FF' for the first ordering shown above, and 'FF FE' for the latter.
A declaration is required for formats other than UTF-8 and UTF-16. A document that conforms to ISO 646 can 'pretend' to be a UTF-8 file, as it is a valid subset, but an ISO 8859 document must be explicitly identified:
<?xml ... encoding="ISO-8859-1" ... ?>
Note that a variant of this tag, the Encoding Processing Instruction, should appear at the top of each external entity, where it is used to identify entities with a different character encoding to the main file:
<?xml encoding="EUC-JP" ?>
If an XML document is delivered over the Internet, the encoding scheme in use may be determined from the MIME header. When the MIME type is 'text/xml', the character encoding is provided in the 'charset' parameter of the MIME header.
In XHTML, the Meta element is used to specify character sets:
<meta http-equiv="content-type" content="text/xhtml; charset=EUC-JP" />
3.138.105.124