14.1. Concepts and Observations

The Unicode Standard is the basis of the character set for XML. Further, most software in the United States makes use of the Latin character subset. The following sections contain guidelines for regular expressions. They also provide a comparison to regular expressions in the Perl language, because many programmers are familiar with Perl. Finally, there is a brief overview of how regular expressions integrate into base and derived types in an XML schema.

14.1.1. Unicode Regular Expression Guidelines

The Unicode Regular Expression Guidelines is a technical report associated with The Unicode Standard. The Schema Recommendation suggests that an XML validator should implement “Level 1” regular expressions as defined in the Unicode Regular Expression Guidelines. There are a few global differences between the Schema Recommendation and the Unicode Regular Expression Guidelines:

  • The Schema Recommendation specifies that the XML hex notation provides the capability to specify Unicode characters. The XML hex notation, which looks like ‘&#xnnnn ;’, is a valid substitution of the ‘unnnn ’ notation, according to the Unicode Regular Expression Guidelines. Note that nnnn represents a specific Unicode character.

  • The Unicode Regular Expression Guidelines suggests—but does not go as far as recommending—that many regular expression engines support line or paragraph separators; the Schema Recommendation does not provide support for these separators.

Otherwise, the Schema Recommendation is consistent with the Unicode Regular Expression Guidelines except where explicitly noted in the subsequent sections of this chapter.

14.1.2. The Latin Character Set

The English language is a derivation of Latin. Similarly, the foundation of many programming languages, as well as data, comes from the ‘BasicLatin’ character block defined by The Unicode Standard (see Section 14.2.5.6). Note that this particular character block overlaps with ASCII characters.

In general, a discussion regarding character sets is superfluous. Regular expressions, however, require extra care. Many single-character patterns (such as ‘p{Lu}’, which matches uppercase characters, and ‘w’, which matches a word—a sequence of selected non-whitespace characters) match characters from many character sets. The result is that what seems like an innocuous regular expression might match many strings that a program or part of a program is probably not ready to handle. Suppose, for example, that an XML schema contains the regular expression ‘the w word’. Many developers would expect—in the XML instance—to see Latin-based strings that conform to this pattern, such as perhaps “the green word” or “the automobile word” (where “green” and “automobile” are valid pattern matches with ‘w’). In all probability, the XML instance contains the expected Latin-based strings. However the regular expression also matches strings that contain, say, Greek or Arabic characters. Such a string might look like “the ΩπΔ word”.

14.1.3. Perl Regular Expressions

Many software developers are familiar with regular expressions from the Perl programming language. The regular expressions for XML schemas are similar to regular expressions in Perl, with two important differences:

  • There is no support in XML schema regular expressions for line separators. Perl provides line separators such as ‘^’ to match the beginning of a line and ‘$’ to match the end of a line.

  • The XML schema regular expressions do not support the notion of a “lookahead” anchor or a “lookbehind” anchor, both of which match a pattern, but do not become part of the result—a concept not particularly relevant to XML schemas.

Finally, for those perhaps familiar with other sophisticated string-processing languages, XML schemas do not support the notion of a “fence”, where the regular expression prohibits back-tracking.

14.1.4. XML Schemas

Section 14.3 goes into detail about creating simple types that validate values in XML instances against regular expressions. Two overriding—and opposing—rules are worth noting here:

  • When a simple type specifies multiple regular expression patterns, the value in an XML instance must be a legal value for at least one of the regular expressions.

  • When both the base and derived simple types specify regular expressions, the value in an XML instance must be a legal value for at least one of the regular expressions in the base type and at least one of the regular expressions in the derived type.

Caution

The interaction between two pattern constraining facets specified by the same simple type is different than the interaction between two pattern constraining facets when one is specified by the base type and one by the derived type. See this section and Section 14.3 for clarification.


The previous rules also apply to complex types constrained to simple content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.111.211