Appendices

A.. References

A.1. Normative References

IANA

(Internet Assigned Numbers Authority) Official Names for Character Sets, ed. Keld Simonsen et al. See ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets.

IETF RFC 1766

IETF (Internet Engineering Task Force). RFC 1766: Tags for the Identification of Languages, ed. H. Alvestrand. 1995.

ISO 639

(International Organization for Standardization). ISO 639:1988 (E). Code for the representation of names of languages. [Geneva]: International Organization for Standardization, 1988.

ISO 3166

(International Organization for Standardization). ISO 3166-1:1997 (E). Codes for the representation of names of countries and their subdivisions — Part 1: Country codes [Geneva]: International Organization for Standardization, 1997.

ISO/IEC 10646

ISO (International Organization for Standardization). ISO/IEC 10646-1993 (E). Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane. [Geneva]: International Organization for Standardization, 1993 (plus amendments AM 1 through AM 7).

Unicode

The Unicode Consortium. The Unicode Standard, Version 2.0. Reading, Mass.: Addison-Wesley Developers Press, 1996.

A.2. Other References

Aho/Ullman

Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Reading: Addison-Wesley, 1986, rpt. corr. 1988.

Berners-Lee et al.

Berners-Lee, T., R. Fielding, and L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax and Semantics. 1997. (Work in progress; see updates to RFC1738.)

Brüggemann-Klein

Brüggemann-Klein, Anne. Regular Expressions into Finite Automata. Extended abstract in I. Simon, Hrsg., LATIN 1992, S. 97-98. Springer-Verlag, Berlin 1992. Full Version in Theoretical Computer Science 120: 197–213, 1993.

Brüggemann-Klein and Wood

Brüggemann-Klein, Anne, and Derick Wood. Deterministic Regular Languages. Universitaüt Freiburg, Institut für Informatik, Bericht 38, Oktober 1991.

Clark

James Clark. Comparison of SGML and XML. See http://www.w3.org/TR/NOTE-sgml-xml-971215.

IETF RFC1738

IETF (Internet Engineering Task Force). RFC 1738: Uniform Resource Locators (URL), ed. T. Berners-Lee, L. Masinter, M. McCahill. 1994.

IETF RFC1808

IETF (Internet Engineering Task Force). RFC 1808: Relative Uniform Resource Locators, ed. R. Fielding. 1995.

IETF RFC2141

IETF (Internet Engineering Task Force). RFC 2141: URN Syntax, ed. R. Moats. 1997.

ISO 8879

ISO (International Organization for Standardization). ISO 8879:1986(E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). First edition — 1986-10-15. [Geneva]: International Organization for Standardization, 1986.

ISO/IEC 10744

ISO (International Organization for Standardization). ISO/IEC 10744-1992 (E). Information technology — Hypermedia/Time-based Structuring Language (HyTime).[Geneva]: International Organization for Standardization, 1992. Extended Facilities Annexe. [Geneva]: International Organization for Standardization, 1996.

B.. Character Classes

Following the characteristics defined in the Unicode standard, characters are classed as base characters (among others, these contain the alphabetic characters of the Latin alphabet, without diacritics), ideographic characters, and combining characters (among others, this class contains most diacritics); these classes combine to form the class of letters. Digits and extenders are also distinguished.

Characters

[84]         Letter ::=  BaseChar | Ideographic
[85]       BaseChar ::=  [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6]
                         | [#x00D8-#x00F6]
                         | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E]
                         | [#x0141-#x0148] | [#x014A-#x017E] | [#x0180-#x01C3]
                         | [#x01CD-#x01F0] | [#x01F4-#x01F5] | [#x01FA-#x0217]
                         | [#x0250-#x02A8] | [#x02BB-#x02C1] | #x0386
                         | [#x0388-#x038A]
                         | #x038C | [#x038E-#x03A1] | [#x03A3-#x03CE]
                         | [#x03D0-#x03D6]
                         | #x03DA | #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3]
                         | [#x0401-#x040C] | [#x040E-#x044F] | [#x0451-#x045C]
                         | [#x045E-#x0481] | [#x0490-#x04C4] | [#x04C7-#x04C8]
                         | [#x04CB-#x04CC] | [#x04D0-#x04EB] | [#x04EE-#x04F5]
                         | [#x04F8-#x04F9] | [#x0531-#x0556] | #x0559
                         | [#x0561-#x0586] 
                         | [#x05D0-#x05EA] | [#x05F0-#x05F2] | [#x0621-#x063A]
                         | [#x0641-#x064A] | [#x0671-#x06B7] | [#x06BA-#x06BE]
                         | [#x06C0-#x06CE] | [#x06D0-#x06D3] | #x06D5
                         | [#x06E5-#x06E6]
                         | [#x0905-#x0939] | #x093D | [#x0958-#x0961]
                         | [#x0985-#x098C]
                         | [#x098F-#x0990] | [#x0993-#x09A8] | [#x09AA-#x09B0]
                         | #x09B2
                         | [#x09B6-#x09B9] | [#x09DC-#x09DD] | [#x09DF-#x09E1]
                         | [#x09F0-#x09F1] | [#x0A05-#x0A0A] | [#x0A0F-#x0A10]
                         | [#x0A13-#x0A28] | [#x0A2A-#x0A30] | [#x0A32-#x0A33]
                         | [#x0A35-#x0A36] | [#x0A38-#x0A39] | [#x0A59-#x0A5C]
                         | #x0A5E
                         | [#x0A72-#x0A74] | [#x0A85-#x0A8B] | #x0A8D
                         | [#x0A8F-#x0A91]
                         | [#x0A93-#x0AA8] | [#x0AAA-#x0AB0] | [#x0AB2-#x0AB3]
                         | [#x0AB5-#x0AB9] | #x0ABD | #x0AE0 | [#x0B05-#x0B0C]
                         | [#x0B0F-#x0B10] | [#x0B13-#x0B28] | [#x0B2A-#x0B30]
                         | [#x0B32-#x0B33] | [#x0B36-#x0B39] | #x0B3D
                         | [#x0B5C-#x0B5D]
                         | [#x0B5F-#x0B61] | [#x0B85-#x0B8A] | [#x0B8E-#x0B90]
                         | [#x0B92-#x0B95] | [#x0B99-#x0B9A] | #x0B9C
                         | [#x0B9E-#x0B9F] 
                         | [#x0BA3-#x0BA4] | [#x0BA8-#x0BAA] | [#x0BAE-#x0BB5]
                         | [#x0BB7-#x0BB9] | [#x0C05-#x0C0C] | [#x0C0E-#x0C10]
                         | [#x0C12-#x0C28] | [#x0C2A-#x0C33] | [#x0C35-#x0C39]
                         | [#x0C60-#x0C61] | [#x0C85-#x0C8C] | [#x0C8E-#x0C90]
                         | [#x0C92-#x0CA8] | [#x0CAA-#x0CB3] | [#x0CB5-#x0CB9]
                         | #x0CDE
                         | [#x0CE0-#x0CE1] | [#x0D05-#x0D0C] | [#x0D0E-#x0D10]
                         | [#x0D12-#x0D28] | [#x0D2A-#x0D39] | [#x0D60-#x0D61]
                         | [#x0E01-#x0E2E] | #x0E30 | [#x0E32-#x0E33]
                         | [#x0E40-#x0E45]
                         | [#x0E81-#x0E82] | #x0E84 | [#x0E87-#x0E88]
                         | #x0E8A | #x0E8D
                         | [#x0E94-#x0E97] | [#x0E99-#x0E9F] | [#x0EA1-#x0EA3]
                         | #x0EA5
                         | #x0EA7 | [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] | #x0EB0
                         | [#x0EB2-#x0EB3] | #x0EBD | [#x0EC0-#x0EC4]
                         | [#x0F40-#x0F47] 
                         | [#x0F49-#x0F69] | [#x10A0-#x10C5] | [#x10D0-#x10F6]
                         | #x1100
                         | [#x1102-#x1103] | [#x1105-#x1107] | #x1109
                         | [#x110B-#x110C]
                         | [#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C
                         | #x114E
                         | #x1150 | [#x1154-#x1155] | #x1159 | [#x115F-#x1161]
                         | #x1163
                         | #x1165 | #x1167 | #x1169 | [#x116D-#x116E]
                         | [#x1172-#x1173]
                         | #x1175 | #x119E | #x11A8 | #x11AB | [#x11AE-#x11AF]
                         | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB
                         | #x11F0
                         | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9]
                         | [#x1F00-#x1F15] 
                         | [#x1F18-#x1F1D] | [#x1F20-#x1F45] | [#x1F48-#x1F4D]
                         | [#x1F50-#x1F57] | #x1F59 | #x1F5B | #x1F5D
                         | [#x1F5F-#x1F7D]
                         | [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] | #x1FBE
                         | [#x1FC2-#x1FC4]
                         | [#x1FC6-#x1FCC] | [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB]
                         | [#x1FE0-#x1FEC] | [#x1FF2-#x1FF4] | [#x1FF6-#x1FFC]
                         | #x2126
                         | [#x212A-#x212B] | #x212E | [#x2180-#x2182]
                         | [#x3041-#x3094]
                         | [#x30A1-#x30FA] | [#x3105-#x312C] | [#xAC00-#xD7A3]
[86]    Ideographic ::=  [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]
[87]  CombiningChar ::=  [#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486]
                         | [#x0591-#x05A1]
                         | [#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF
                         | [#x05C1-#x05C2]
                         | #x05C4 | [#x064B-#x0652] | #x0670 | [#x06D6-#x06DC]
                         | [#x06DD-#x06DF] | [#x06E0-#x06E4] | [#x06E7-#x06E8]
                         | [#x06EA-#x06ED] | [#x0901-#x0903] | #x093C
                         | [#x093E-#x094C] 
                         | #x094D | [#x0951-#x0954] | [#x0962-#x0963]
                         | [#x0981-#x0983]
                         | #x09BC | #x09BE | #x09BF | [#x09C0-#x09C4]
                         | [#x09C7-#x09C8]
                         | [#x09CB-#x09CD] | #x09D7 | [#x09E2-#x09E3] | #x0A02
                         | #x0A3C
                         | #x0A3E | #x0A3F | [#x0A40-#x0A42] | [#x0A47-#x0A48]
                         | [#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83]
                         | #x0ABC
                         | [#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD]
                         | [#x0B01-#x0B03] | #x0B3C | [#x0B3E-#x0B43]
                         | [#x0B47-#x0B48]
                         | [#x0B4B-#x0B4D] | [#x0B56-#x0B57] | [#x0B82-#x0B83]
                         | [#x0BBE-#x0BC2] | [#x0BC6-#x0BC8] | [#x0BCA-#x0BCD]
                         | #x0BD7
                         | [#x0C01-#x0C03] | [#x0C3E-#x0C44] | [#x0C46-#x0C48]
                         | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] | [#x0C82-#x0C83]
                         | [#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD]
                         | [#x0CD5-#x0CD6] | [#x0D02-#x0D03] | [#x0D3E-#x0D43]
                         | [#x0D46-#x0D48] | [#x0D4A-#x0D4D] | #x0D57 | #x0E31
                         | [#x0E34-#x0E3A] | [#x0E47-#x0E4E] | #x0EB1
                         | [#x0EB4-#x0EB9] 
                         | [#x0EBB-#x0EBC] | [#x0EC8-#x0ECD] | [#x0F18-#x0F19]
                         | #x0F35
                         | #x0F37 | #x0F39 | #x0F3E | #x0F3F | [#x0F71-#x0F84]
                         | [#x0F86-#x0F8B] | [#x0F90-#x0F95] | #x0F97
                         | [#x0F99-#x0FAD]
                         | [#x0FB1-#x0FB7] | #x0FB9 | [#x20D0-#x20DC] | #x20E1
                         | [#x302A-#x302F] | #x3099 | #x309A
[88]          Digit ::=  [#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9]
                         | [#x0966-#x096F]
                         | [#x09E6-#x09EF] | [#x0A66-#x0A6F] | [#x0AE6-#x0AEF]
                         | [#x0B66-#x0B6F] | [#x0BE7-#x0BEF] | [#x0C66-#x0C6F]
                         | [#x0CE6-#x0CEF] | [#x0D66-#x0D6F] | [#x0E50-#x0E59]
                         | [#x0ED0-#x0ED9] | [#x0F20-#x0F29]
[89]       Extender ::=  #x00B7 | #x02D0 | #x02D1 | #x0387 | #x0640 | #x0E46
                         | #x0EC6 | #x3005
                         |[#x3031-#x3035] | [#x309D-#x309E] | [#x30FC-#x30FE]

The character classes defined here can be derived from the Unicode character database as follows:

  • Name start characters must have one of the categories Ll, Lu, Lo, Lt, Nl.

  • Name characters other than Name-start characters must have one of the categories Mc, Me, Mn, Lm, or Nd.

  • Characters in the compatibility area (i.e. with character code greater than #xF900 and less than #xFFFE) are not allowed in XML names.

  • Characters which have a font or compatibility decomposition (i.e. those with a "compatibility formatting tag" in field 5 of the database — marked by field 5 beginning with a "<") are not allowed.

  • The following characters are treated as name-start characters rather than name characters, because the property file classifies them as Alphabetic: [#x02BB-#x02C1], #x0559, #x06E5, #x06E6.

  • Characters #x20DD-#x20E0 are excluded (in accordance with Unicode, section 5.14).

  • Character #x00B7 is classified as an extender, because the property list so identifies it.

  • Character #x0387 is added as a name character, because #x00B7 is its canonical equivalent.

  • Characters ':' and '_' are allowed as name-start characters.

  • Characters '-' and '.' are allowed as name characters.

C.. XML and SGML (Non-Normative)

XML is designed to be a subset of SGML, in that every valid XML document should also be a conformant SGML document. For a detailed comparison of the additional restrictions that XML places on documents beyond those of SGML, see [Clark].

D.. Expansion of Entity and Character References (Non-Normative)

This appendix contains some examples illustrating the sequence of entity- and character-reference recognition and expansion, as specified in "4.4 XML Processor Treatment of Entities and References".

If the DTD contains the declaration

<!ENTITY example "<p>An ampersand (&#38;#38;) may be escaped
numerically (&#38;#38;#38;) or with a general entity
(&amp;amp;).</p>" >

then the XML processor will recognize the character references when it parses the entity declaration, and resolve them before storing the following string as the value of the entity "example":

<p>An ampersand (&#38;) may be escaped
numerically (&#38;#38;) or with a general entity
(&amp;amp;).</p>

A reference in the document to "&example;" will cause the text to be reparsed, at which time the start- and end-tags of the "p" element will be recognized and the three references will be recognized and expanded, resulting in a "p" element with the following content (all data, no delimiters or markup):

An ampersand (&) may be escaped numerically (&#38;) or with a general entity (&amp;).

A more complex example will illustrate the rules and their effects fully. In the following example, the line numbers are solely for reference.

1 <?xml version='1.0'?>
2 <!DOCTYPE test [
3 <!ELEMENT test (#PCDATA) >
4 <!ENTITY % xx '&#37;zz;'>
5 <!ENTITY % zz '&#60;!ENTITY tricky "error-prone" >' >
6 %xx;
7 ]>
8 <test>This sample shows a &tricky; method.</test>

This produces the following:

  • in line 4, the reference to character 37 is expanded immediately, and the parameter entity "xx" is stored in the symbol table with the value "%zz;". Since the replacement text is not rescanned, the reference to parameter entity "zz" is not recognized. (And it would be an error if it were, since "zz" is not yet declared.)

  • in line 5, the character reference "&#60;" is expanded immediately and the parameter entity "zz" is stored with the replacement text "<!ENTITY tricky" error-prone" >", which is a well-formed entity declaration.

  • in line 6, the reference to "xx" is recognized, and the replacement text of "xx" (namely "%zz;") is parsed. The reference to "zz" is recognized in its turn, and its replacement text ("<!ENTITY tricky" error-prone" >") is parsed. The general entity "tricky" has now been declared, with the replacement text "error-prone".

  • in line 8, the reference to the general entity "tricky" is recognized, and it is expanded, so the full content of the "test" element is the self-describing (and ungrammatical) string This sample shows a error-pronemethod.

E.. Deterministic Content Models (Non-Normative)

For compatibility, it is required that content models in element type declarations be deterministic.

SGML requires deterministic content models (it calls them "unambiguous"); XML processors built using SGML systems may flag non-deterministic content models as errors.

For example, the content model ((b, c) | (b, d)) is non-deterministic, because given an initial b the parser cannot know which b in the model is being matched without looking ahead to see which element follows the b. In this case, the two references to b can be collapsed into a single reference, making the model read (b, (c | d)). An initial b now clearly matches only a single name in the content model. The parser doesn't need to look ahead to see what follows; either c or d would be accepted.

More formally: a finite state automaton may be constructed from the content model using the standard algorithms, e.g. algorithm 3.5 in section 3.9 of Aho, Sethi, and Ullman [Aho/Ullman]. In many such algorithms, a follow set is constructed for each position in the regular expression (i.e., each leaf node in the syntax tree for the regular expression); if any position has a follow set in which more than one following position is labeled with the same element type name, then the content model is in error and may be reported as an error.

Algorithms exist which allow many but not all non-deterministic content models to be reduced automatically to equivalent deterministic models; see Brüggemann-Klein 1991 [Brüggemann-Klein].

F.. Autodetection of Character Encodings (Non-Normative)

The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use—which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases. Also, in many cases other sources of information are available in addition to the XML data stream itself. Two cases may be distinguished, depending on whether the XML entity is presented to the processor without, or with, any accompanying (external) information. We consider the first case first.

Because each XML entity not in UTF-8 or UTF-16 format must begin with an XML encoding declaration, in which the first characters must be '<?xml', any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the Byte Order Mark required of UTF-16 data streams is "#xFEFF".

  • 00 00 00 3C: UCS-4, big-endian machine (1234 order)

  • 3C 00 00 00: UCS-4, little-endian machine (4321 order)

  • 00 00 3C 00: UCS-4, unusual octet order (2143)

  • 00 3C 00 00: UCS-4, unusual octet order (3412)

  • FE FF: UTF-16, big-endian

  • FF FE: UTF-16, little-endian

  • 00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus, strictly speaking, in error)

  • 3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark (and thus, strictly speaking, in error)

  • 3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the ASCII characters, the encoding declaration itself may be read reliably

  • 4C 6F A7 94: EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in use)

  • other: UTF-8 without an encoding declaration, or else the data stream is corrupt, fragmentary, or enclosed in a wrapper of some kind

This level of autodetection is enough to read the XML encoding declaration and parse the character-encoding identifier, which is still necessary to distinguish the individual members of each family of encodings (e.g. to tell UTF-8 from 8859, and the parts of 8859 from each other, or to distinguish the specific EBCDIC code page in use, and so on).

Because the contents of the encoding declaration are restricted to ASCII characters, a processor can reliably read the entire encoding declaration as soon as it has detected which family of encodings is in use. Since in practice, all widely used character encodings fall into one of the categories above, the XML encoding declaration allows reasonably reliable in-band labeling of character encodings, even when external sources of information at the operating-system or transport-protocol level are unreliable.

Once the processor has detected the character encoding in use, it can act appropriately, whether by invoking a separate input routine for each case, or by calling the proper conversion function on each character of input.

Like any self-labeling system, the XML encoding declaration will not work if any software changes the entity's character set or encoding without updating the encoding declaration. Implementers of character-encoding routines should be careful to ensure the accuracy of the internal and external information used to label the entity.

The second possible case occurs when the XML entity is accompanied by encoding information, as in some file systems and some network protocols. When multiple sources of information are available, their relative priority and the preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML. Rules for the relative priority of the internal label and the MIME-type label in an external header, for example, should be part of the RFC document defining the text/xml and application/xml MIME types. In the interests of interoperability, however, the following rules are recommended.

  • If an XML entity is in a file, the Byte-Order Mark and encoding- declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery.

  • If an XML entity is delivered with a MIME type of text/xml, then the charset parameter on the MIME type determines the character encoding method; all other heuristics and sources of information are solely for error recovery.

  • If an XML entity is delivered with a MIME type of application/xml, then the Byte-Order Mark and encoding-declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery.

These rules apply only in the absence of protocol-level documentation; in particular, when the MIME types text/xml and application/xml are defined, the recommendations of the relevant RFC will supersede these rules.

G.. W3C XML Working Group (Non-Normative)

This specification was prepared and approved for publication by the W3C XML Working Group (WG). WG approval of this specification does not necessarily imply that all WG members voted for its approval. The current and former members of the XML WG are:

Jon Bosak, Sun (Chair); James Clark (Technical Lead); Tim Bray, Textuality and Netscape (XML Co-editor); Jean Paoli, Microsoft (XML Co-editor); C. M. Sperberg-McQueen, U. of Ill. (XML Co-editor); Dan Connolly, W3C (W3C Liaison); Paula Angerstein, Texcel; Steve DeRose, INSO; Dave Hollander, HP; Eliot Kimber, ISOGEN; Eve Maler, ArborText; Tom Magliery, NCSA; Murray Maloney, Muzmo and Grif; Makoto Murata, Fuji Xerox Information Systems; Joel Nava, Adobe; Conleth O'Connell, Vignette; Peter Sharpe, SoftQuad; John Tigue, DataChannel

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.179.100