The Document Information Item is the root of the information found in an XML document. There is only one such root item.
This information item begins with the
ContentHandler.startDocument()
call and
ends with the ContentHandler.endDocument()
call. Many SAX2 event calls are used to construct its children
or constituents.
Property |
Callbacks |
Explanation |
---|---|---|
[children] |
|
See the sections for each type of Information Item: Document Type Declaration (one, if present), Element (one), processing instruction (possibly many), Comment (possibly many). |
[document element] |
|
This is the element in the [children] property. |
[notations] |
|
See the section on Notation Information Items. (Unordered.) |
[unparsed entities] |
|
See the section on Unparsed Entity Information Items. (Unordered.) |
[base URI] |
Locator may be used
during the Alternatively, for any parsers that don’t
provide a Locator,
applications using an XMLReader
are responsible for providing this information (if it
exists) to the | |
[character encoding scheme] |
Normally this property is unavailable; it won’t
affect the interpretation of character data in Java.
However, applications will in rare
cases provide this to the parser when they call
| |
[standalone] |
It’s likely that an upcoming extension API will provide this information using an is-standalone feature flag. | |
[version] |
unavailable |
You can probably assume the value of this property is “1.0” for now. It’s likely that an upcoming extension API will provide this information. |
[all declarations processed] |
When |
Because text in Java is always accessed using UTF-16 character strings or arrays, most applications won’t need to worry about encoding issues; the SAX2 parser handles that. However, there are cases when encoding may matter:
Some recent XML standards require that text be normalized. For example, XML Canonicalization (as used in digital signature applications) requires the use of Unicode Normalization Form C; some other W3C specifications have the same requirement. Text originally represented in UTF-8 or UTF-16 might need further normalization to remove some deprecated character codes that can be represented using those encodings.
Such encoding data is required on a per-entity basis, not a per-document basis as implied by the Infoset specification. And for internal entity expansions or defaulted attributes, you’ll need to normalize if the encoding associated with the original definition supported denormalized text.
When using an output encoding that is not based on the Unicode character set, you may not be able to represent XML names that use particular characters. For example, ASCII cannot handle element or attribute names using accented characters (used in Europe and Latin America) or using ideographic characters (used in Asia).
The preferred encoding solution is to always use UTF-8 or UTF-16 when outputting XML, so that such problems cannot occur and so that all XML processors can work with such output. Similar logic applies to display systems like window systems: prefer font rendering systems that use Unicode over those tied to some specific encoding.
3.21.246.218