Regular Expressions for Patterns
The pattern facet uses the familiar regular expression syntax to restrict the lexical space of data types. This appendix gives an overview of the pattern syntax.
Complex regular expressions can be constructed from simpler ones with the help of operators. Let S and T be arbitrary regular expressions, c and d be normal characters, and C be a character class (listed below).
A normal character c is any character that is not a metacharacter. Metacharacters are ., , ?, *, +, {,}, (,), [, and ].
Expression |
Meaning |
c |
The string consisting of character c. |
C |
The string consisting of a character belonging to character class C. |
c-d |
The string consisting of any single character whose code value is between c and d (inclusive). |
ST |
Concatenation. All strings st with s matching S and t matching T. |
S|T |
Choice. All strings s that match S or T. |
^S |
Negation. All strings s that do not match S. |
S-T |
Difference. All strings s that match S but not T. |
S? |
Option. All strings s that match S, or the empty string. |
S* |
Powerset. All strings s matching k repetitions of S (including k = 0). |
S+ |
All strings s matching k repetitions of S (including k > 0). |
S{n,m} |
All strings s matching k repetitions of S (n <= k <= m). |
S{n} |
All strings s matching exactly n repetitions of S. |
S{n,} |
All strings s matching at least n repetitions of S. |
Escape sequences can be used to represent characters that would otherwise be regarded as metacharacters.
Escape Sequence |
Represented Character |
|
The newline character (#xA) |
|
The return character (#xD) |
|
The tab character (#x9) |
\ |
|
| |
| |
. |
. |
- |
- |
^ |
^ |
? |
? |
* |
* |
+ |
+ |
{ |
{ |
} |
} |
( |
( |
) |
) |
[ |
[ |
] |
] |
Character classes are represented by the following escape sequence:
p{×} |
A character belonging to the category denoted by X (see the following tables). |
Letter categories:
Category |
Represented Characters |
L |
All letters. |
Lu |
Only uppercase letters. |
Ll |
Only lowercase letters. |
Lt |
First character of a word may be uppercase depending on language (see Unicode technical report #21). |
Lm |
Modifier. Various characters such as accents modifying the pronunciation of a character. |
Lo |
Other. |
Marks categories:
Category |
Represented Characters |
M |
All marks. |
Mn |
All marks, except nonspacing marks. |
Mc |
Marks combined with whitespace. |
Me |
Enclosing marks. |
Numbers categories:
Category |
Represented Characters |
N |
All numbers. This includes numbers that do not rely on decimal digits such as roman numbers, encircled numbers, bracketed numbers, etc. |
Nd |
Only decimal digits. |
Nl |
Letter digits such as roman numbers. |
No |
All other digit symbols. |
Punctuation categories:
Category |
Represented Characters |
P |
All punctuation symbols. |
Pc |
Connector. All connecting symbols, for example, the underscore. |
Pd |
Dash. Various connecting dash symbols. |
Ps |
Open. All opening symbols such as opening parentheses or brackets. |
Pe |
Close. All closing symbols such as closing parentheses or brackets. |
Pi |
Initial quote (may behave like Ps or Pe depending on usage). |
Pf |
Final quote (may behave like Ps or Pe depending on usage). |
Po |
Other punctuation symbols. |
Separator categories:
Category |
Represented Characters |
Z |
All separators. |
Zs |
Separating space character. |
Zl |
Line separators. |
Zp |
Paragraph separators. |
Symbol categories:
Category |
Represented Characters |
S |
All symbols. |
Sm |
Mathematical symbols. |
Sc |
Currency symbols ($, €). |
Sk |
Modifier. Various characters such as accents modifying the |
|
pronunciation of a character, similar to Lm. |
So |
Other symbols. |
Other categories:
Category |
Represented Characters |
C |
All other characters. |
Cc |
Control. Nonprintable control characters. |
Cf |
Formatting characters. |
Co |
Private use for user-defined characters. |
Cn |
Not assigned (no specific meaning within Unicode). |
Abbreviations:
Escape Sequence |
Equivalent To |
Legend |
. |
[^(
|
)] |
Anything except newline or carriage return. |
s |
t(#x20| |
|
}] |
XML whitespace characters. |
S |
[^s] |
XML non-whitespace characters. |
i |
The set of initial XML name characters (Letter ‘_’ | ‘:’). |
|
I |
[^i] |
Anything except initial XML name characters. |
c |
The set of XML name characters (NameChar). |
|
C |
[^c] |
Anything except XML name characters. |
d |
p{Nd} |
Decimal digits. |
D |
[^d] |
Anything except decimal digits. |