The primary Infoset concern for SAX2 event consumers is to understand how the stream of events represents the information structures used in the Infoset. Applications need to track some state if they need access to some of those structures or random access to anything. It’s typical to track only a few items, and ignore the rest as being incidental background noise. Streaming processing discards items as soon as possible.
You really shouldn’t care, but since the String datatype can’t handle more than two gigabytes of data, and strings are used to pass certain document data to applications, there’s a chance that some documents could cause trouble by overflowing that limit. If you encounter such a document, consult a pathologist. There really isn’t much you can do about this.
The [children] properties are arbitrarily sized, ordered sequences of information items, which are presented in document order by SAX2 event callbacks. Most other information items are not ordered, such as [notations], [unparsed entities], and [attributes] properties. Only [children] properties would need to be stored in order-preserving data structures.
While most information items are provided through a single
callback, some of the more complex ones involve matched,
and (except in one case) cleanly nested, pairs of calls to start()
and end()
the item. Such items include the Document itself, its Document
Type Declaration, Elements, and Namespace Information. To track
those items, applications implement some kind of context
stack tracking.
The [parent] properties of some information items are
implicitly encoded through such SAX2 nested event reports.
Except for items that can be direct children of the Document
or Document Type Information Items, applications often push
stack entries when startElement()
is called
and pop them when endElement()
is called.
The children of Document and Document Type Information Items have curious restrictions: they don’t always match the actual text structure. For example, information items for notations and unparsed entities are found in the Document Information Item, but they’re textually part of the Document Type; and comments are stripped out of DTDs. You can use more natural structures in your applications if the descriptive Infoset structure seems awkward.
Other complex information items are implicitly decoded from DTD declarations. To track such items, applications must save declarations during DTD processing, to ensure that they can be correlated with information in the body of a document. Examples of such items include [notation] properties for Unparsed Entities and processing instructions, most properties for Unexpanded Entity References, and [references] properties of attributes.
Some information items have a [base URI] property
that is computed according to xml:base
rules.
Except for two cases, these rules amount to using
Locator.getSystemId()
to find the absolute
base URI; the producer needs to provide this information.
SAX2 effectively augments every
information item with this information, as well
as line and column location within such entities.
(However, applications can cause this information to be lost
if they provide InputSource objects
without including those base URIs as the system IDs.)
The two exceptional cases are for Elements and for
processing instructions within the document element.
In these instances, the computation is complex because
xml:base
attributes can play a role;
it is demonstrated in Example 5-1.
Consumers must be able to invoke
Locator.getSystemId()
to get the entity’s
URI in LexicalHandler.startEntity()
when the entity is shown to be external using
DeclHandler.externalEntityDecl()
.
And they must also maintain a stack of URIs, augmenting
it with xml:base
values.
Application code should use
Locator information to generate
meaningful diagnostics.
However, conforming applications will use the URI computed
with xml:base
when absolutizing relative
URIs found in attribute values,
character data, processing instructions, or
(primarily for HTML legacy data models) comments. Except for the startDTD()
call, all system identifiers reported through SAX are delivered as absolute URIs. An upcoming extension feature flag will probably let that behavior be changed, so you can choose whether the parser or the application absolutizes the URIs. Meanwhile, you should be aware that some SAX parsers have bugs in how they report such identifiers.
3.147.65.247