To close this chapter, I’ll present some ideas that should ease the challenge of disambiguating schemas.
In the different
forms of ambiguity, name classes have been the easiest ones to
disambiguate. Why is this? Name classes aren’t
inherently simpler than regular expressions or datatypes. All these
tools are about defining sets of things that can happen in XML
documents and in many ways, they are deeply similar. The reason that
name classes and datatypes have been easier to disambiguate is
because they have a first class except
operator.
If you had the same level of support for patterns and datatypes, you
could more easily disambiguate them.
It is possible to apply the except
pattern to
datatypes and write:
element foo{ (xsd:boolean - xsd:integer) |xsd:integer}
A value that is only integer
will obviously match
only the right alternative. A value that is exclusively
boolean
(true
or
false
) matches the left alternative. A value that
is both a boolean and an integer (0
or
1
) matches the first condition of the left
alternative (xsd:boolea
n) but
doesn’t match the exception clause.
Unfortunately, this rule can’t be generalized beyond
the scope of data
patterns. (Note that the
examples given next with the except
(-
) operator aren’t valid RELAX
NG.)
If this rule could be generalized, and applied to an ambiguous regular expression such as:
two|(one?,two+,three*)
you could write:
two|((one?,two+,three*)-two)
Of course, this same set of results can be created with the existing
RELAX NG patterns, but a generalized except
would
make that flexibility much more accessible.
My second proposal is far less disruptive. The idea is just the realization that these ambiguities are ambiguous because you haven’t done anything to rule them out. There are plenty of examples in other computer languages of ambiguities that have been partially or fully ruled out: XSLT templates, order of evaluation of statements in programming languages, or, as we’ve seen in the section about W3C XML Schema, union of datatypes.
There is nothing preventing the creation of a specification defining a priority for the alternatives to be used by applications interested in instance annotation at large when they encounter ambiguities.
This specification wouldn’t need to apply to RELAX NG processors interested only in validation and would not compromise their optimizations. It could apply only to RELAX NG processors performing instance annotation. It would also guarantee a consistent and interoperable type of annotation for schemas that are currently considered to be ambiguous.
The rule could be as simple as “use the first alternative in document order” or could also take into account additional factors, such as giving a lesser precedence to included grammars, as XSLT does with stylesheet imports.
Jeni Tennison
proposed a third approach on the xml-dev
mailing
list: instead of trying to fight against
ambiguity,
why not accept it? Why couldn’t we acknowledge that
something can have several datatypes (or models) and at the same time
have a datatype “A” and
“B”? Why couldn’t
a value be an integer
and a
boolean
simultaneously?
This idea would have a serious impact on specifications, such as XPath 2.0—that assign a single datatype to each simple type element and attribute, but this approach would be much more compatible with the principle that markup is only the projection of a structure over a document. It often happens that a piece of text can have several meanings. By extension, acknowledging that elements and attributes may belong to multiple datatypes at the same time seems like something obvious, yet clever, to do.
13.59.212.54