This chapter describes regular expression pattern matching and string processing based on regular expression substitutions. These features provide the most powerful string processing facilities in Tcl. Tcl commands described are: regexp
and regsub
.
Regular expressions are a formal way to describe string patterns. They provide a powerful and compact way to specify patterns in your data. Even better, there is a very efficient implementation of the regular expression mechanism due to Henry Spencer. If your script does much string processing, it is worth the effort to learn about the regexp
command. Your Tcl scripts will be compact and efficient. This chapter uses many examples to show you the features of regular expressions.
Regular expression substitution is a mechanism that lets you rewrite a string based on regular expression matching. The regsub
command is another powerful tool, and this chapter includes several examples that do a lot of work in just a few Tcl commands. Stephen Uhler has shown me several ways to transform input data into a Tcl script with regsub
and then use subst
or eval
to process the data. The idea takes a moment to get used to, but it provides a very efficient way to process strings.
Tcl 8.1 added a new regular expression implementation that supports Unicode and advanced regular expressions (ARE). This implementation adds more syntax and escapes that makes it easier to write patterns, once you learn the new features! If you know Perl, then you are already familiar with these features. The Tcl advanced regular expressions are almost identical to the Perl 5 regular expressions. The new features include a few very minor incompatibilities with the regular expressions implemented in earlier versions of Tcl 8.0, but these rarely occur in practice. The new regular expression package supports Unicode, of course, so you can write patterns to match Japanese or Hindi documents!
Regular expressions can seem overly complex at first. They introduce their own syntax and their own rules, and you may be tempted to use simpler commands like string first
, string range
, or string match
to process your strings. However, often a single regular expression command can replace a sequence of several string
commands. Not only do you have to write less code, but you often get a performance improvement because the regular expression matcher is implemented in optimized C code, so pattern matching is fast.
The regular expression matcher does more than test for a match. It also tells you what part of your input string matches the pattern. This is useful for picking data out of a large input string. In fact, you can capture several pieces of data in just one match by using subexpressions. The regexp
Tcl command makes this easy by assigning the matching data to Tcl variables. If you find yourself using string first
and string range
to pick out data, remember that regexp
can do it in one step instead.
The regular expression matcher is structured so that patterns are first compiled into an form that is efficient to match. If you use the same pattern frequently, then the expensive compilation phase is done only once, and all your matching uses the efficient form. These details are completely hidden by the Tcl interface. If you use a pattern twice, Tcl will nearly always be able to retrieve the compiled form of the pattern. As you can see, the regular expression matcher is optimized for lots of heavy-duty string processing.
One of the stumbling blocks with regular expressions is that they use some of the same special characters as Tcl. Any pattern that contains brackets, dollar signs, or spaces must be quoted when used in a Tcl command. In many cases you can group the regular expression with curly braces, so Tcl pays no attention to it. However, when using Tcl 8.0 (or earlier) you may need Tcl to do backslash substitutions on part of the pattern, and then you need to worry about quoting the special characters in the regular expression.
Advanced regular expressions eliminate this problem because backslash substitution is now done by the regular expression engine. Previously, to get
to mean the newline character (or
for tab) you had to let Tcl do the substitution. With Tcl 8.1,
and
inside a regular expression mean newline and tab. In fact, there are now about 20 backslash escapes you can use in patterns. Now more than ever, remember to group your patterns with curly braces to avoid conflicts between Tcl and the regular expression engine.
The patterns in the first sections of this chapter ignore this problem. The sample expressions in Table 11-7 on page 161 are quoted for use within Tcl scripts. Most are quoted simply by putting the whole pattern in braces, but some are shown without braces for comparison.
This section describes the basics of regular expression patterns, which are found in all versions of Tcl. There are occasional references to features added by advanced regular expressions, but they are covered in more detail starting on page 149. There is enough syntax in regular expressions that there are five tables that summarize all the options. These tables appear together starting at page 154.
A regular expression is a sequence of the following items:
A literal character.
A matching character, character set, or character class.
A repetition quantifier.
An alternation clause.
A subpattern grouped with parentheses.
Most characters simply match themselves. The following pattern matches an a
followed by a b
:
ab
The general wild-card character is the period, "."
. It matches any single character. The following pattern matches an a
followed by any character:
a.
Remember that matches can occur anywhere within a string; a pattern does not have to match the whole string. You can change that by using anchors, which are described on page 147.
The matching character can be restricted to a set of characters with the [
xyz
]
syntax. Any of the characters between the two brackets is allowed to match. For example, the following matches either Hello
or hello
:
[Hh]ello
The matching set can be specified as a range over the character set with the [
x-y
]
syntax. The following matches any digit:
[0-9]
There is also the ability to specify the complement of a set. That is, the matching character can be anything except what is in the set. This is achieved with the [^
xyz
]
syntax. Ranges and complements can be combined. The following matches anything except the uppercase and lowercase letters:
[^a-zA-Z]
If you want a ]
in your character set, put it immediately after the initial opening bracket. You do not need to do anything special to include [
in your character set. The following matches any square brackets or curly braces:
[][{}]
Most regular expression syntax characters are no longer special inside character sets. This means you do not need to backslash anything inside a bracketed character set except for backslash itself. The following pattern matches several of the syntax characters used in regular expressions:
[][+*?()|\]
Advanced regular expressions add names and backslash escapes as shorthand for common sets of characters like white space, alpha, alphanumeric, and more. These are described on page 149 and listed in Table 11-3 on page 156.
Repetition is specified with *
, for zero or more, +
, for one or more, and ?
, for zero or one. These quantifiers apply to the previous item, which is either a matching character, a character set, or a subpattern grouped with parentheses. The following matches a string that contains b
followed by zero or more a
's:
ba*
You can group part of the pattern with parentheses and then apply a quantifier to that part of the pattern. The following matches a string that has one or more sequences of ab
:
(ab)+
The pattern that matches anything, even the empty string, is:
.*
These quantifiers have a greedy matching behavior: They match as many characters as possible. Advanced regular expressions add nongreedy matching, which is described on page 151. For example, a pattern to match a single line might look like this:
.*
However, as a greedy match, this will match all the lines in the input, ending with the last newline in the input string. The following pattern matches up through the first newline.
[^ ]*
We will shorten this pattern even further on page 151 by using nongreedy quantifiers. There are also special newline sensitive modes you can turn on with some options described on page 153.
Alternation lets you test more than one pattern at the same time. The matching engine is designed to be able to test multiple patterns in parallel, so alternation is efficient. Alternation is specified with |
, the pipe symbol. Another way to match either Hello
or hello
is:
hello|Hello
You can also write this pattern as:
(h|H)ello
or as:
[hH]ello
By default a pattern does not have to match the whole string. There can be unmatched characters before and after the match. You can anchor the match to the beginning of the string by starting the pattern with ^
, or to the end of the string by ending the pattern with $
. You can force the pattern to match the whole string by using both. All strings that begin with spaces or tabs are matched with:
^[ ]+
If you have many text lines in your input, you may be tempted to think of ^
as meaning “beginning of line” instead of “beginning of string.” By default, the ^
and $
anchors are relative to the whole input, and embedded newlines are ignored. Advanced regular expressions support options that make the ^
and $
anchors line-oriented. They also add the A
and anchors that always match the beginning and end of the string, respectively.
Use the backslash character to turn off these special characters :
. * ? + [ ] ( ) ^ $ |
For example, to match the plus character, you will need:
+
Remember that this quoting is not necessary inside a bracketed expression (i.e., a character set definition.) For example, to match either plus or question mark, either of these patterns will work:
(+|?) [+?]
To match a single backslash, you need two. You must do this everywhere, even inside a bracketed expression. Or you can use B, which was added as part of advanced regular expressions. Both of these match a single backslash:
\ B
Versions of Tcl before 8.1 ignored unknown backslash sequences in regular expressions. For example, =
was just =
, and w
was just w
. Even
was just n
, which was probably frustrating to many beginners trying to get a newline into their pattern. Advanced regular expressions add backslash sequences for tab, newline, character classes, and more. This is a convenient improvement, but in rare cases it may change the semantics of a pattern. Usually these cases are where an unneeded backslash suddenly takes on meaning, or causes an error because it is unknown.
If a pattern can match several parts of a string, the matcher takes the match that occurs earliest in the input string. Then, if there is more than one match from that same point because of alternation in the pattern, the matcher takes the longest possible match. The rule of thumb is: first, then longest. This rule gets changed by nongreedy quantifiers that prefer a shorter match.
Watch out for *, which means zero or more, because zero of anything is pretty easy to match. Suppose your pattern is:
[a-z]*
This pattern will match against 123abc
, but not how you expect. Instead of matching on the letters in the string, the pattern will match on the zero-length substring at the very beginning of the input string! This behavior can be seen by using the -indices
option of the regexp command described on page 158. This option tells you the location of the matching string instead of the value of the matching string.
Use parentheses to capture a subpattern. The string that matches the pattern within parentheses is remembered in a matching variable, which is a Tcl variable that gets assigned the string that matches the pattern. Using parentheses to capture subpatterns is very useful. Suppose we want to get everything between the <td>
and </td>
tags in some HTML. You can use this pattern:
<td>([^<]*)</td>
The matching variable gets assigned the part of the input string that matches the pattern inside the parentheses. You can capture many subpatterns in one match, which makes it a very efficient way to pick apart your data. Matching variables are explained in more detail on page 158 in the context of the regexp command.
Sometimes you need to introduce parentheses but you do not care about the match that occurs inside them. The pattern is slightly more efficient if the matcher does not need to remember the match. Advanced regular expressions add noncapturing parentheses with this syntax:
(?:pattern)
The syntax added by advanced regular expressions is mostly just shorthand notation for constructs you can make with the basic syntax already described. There are also some new features that add additional power: nongreedy quantifiers, back references, look-ahead patterns, and named character classes. If you are just starting out with regular expressions, you can ignore most of this section, except for the one about backslash sequences. Once you master the basics, of if you are already familiar with regular expressions in Tcl (or the UNIX vi editor or grep utility), then you may be interested in the new features of advanced regular expressions.
Advanced regular expressions add syntax in an upward compatible way. Old patterns continue to work with the new matcher, but advanced regular expressions will raise errors if given to old versions of Tcl. For example, the question mark is used in many of the new constructs, and it is artfully placed in locations that would not be legal in older versions of regular expressions. The added syntax is summarized in Table 11-2 on page 155.
If you have unbraced patterns from older code, they are very likely to be correct in Tcl 8.1 and later versions. For example, the following pattern picks out everything up to the next newline. The pattern is unbraced, so Tcl substitutes the newline character for each occurrence of
. The square brackets are quoted so that Tcl does not think they delimit a nested command:
regexp "([^ ]+) " $input
The above command behaves identically when using advanced regular expressions, although you can now also write it like this:
regexp {([^ ]+) } $input
The curly braces hide the brackets from the Tcl parser, so they do not need to be escaped with backslash. This saves us two characters and looks a bit cleaner.
The most significant change in advanced regular expression syntax is backslash substitutions. In Tcl 8.0 and earlier, a backslash is only used to turn off special characters such as: . + * ? [ ].
Otherwise it was ignored. For example,
was simply n
to the Tcl 8.0 regular expression engine. This was a source of confusion, and it meant you could not always quote patterns in braces to hide their special characters from Tcl's parser. In advanced regular expressions,
now means the newline character to the regular expression engine, so you should never need to let Tcl do backslash processing.
Again, always group your pattern with curly braces to avoid confusion.
Advanced regular expressions add a lot of new backslash sequences. They are listed in Table 11-4 on page 156. Some of the more useful ones include s,
which matches space-like characters, w,
which matches letters, digit, and the underscore, y,
which matches the beginning or end of a word, and B,
which matches a backslash.
Character classes are names for sets of characters. The named character class syntax is valid only inside a bracketed character set. The syntax is:
[:identifier:]
For example, alpha
is the name for the set of uppercase and lowercase letters. The following two patterns are almost the same:
[A-Za-z] [[:alpha:]]
The difference is that the alpha character class also includes accented characters like è. If you match data that contains nonASCII characters, the named character classes are more general than trying to name the characters explicitly.
There are also backslash sequences that are shorthand for some of the named character classes. The following patterns to match digits are equivalent:
[0-9] [[:digit:]] d
The following patterns match space-like characters including backspace, form feed, newline, carriage return, tag, and vertical tab:
[ f v] [[:space:]] s
The named character classes and the associated backslash sequence are listed in Table 11-3 on page 156.
You can use character classes in combination with other characters or character classes inside a character set definition. The following patterns match letters, digits, and underscore:
[[:digit:][:alpha:]_] [d[:alpha:]_] [[:alnum:]_] w
Note that d, s
and w
can be used either inside or outside character sets. When used outside a bracketed expression, they form their own character set. There are also D, S,
and W,
which are the complement of d, s,
and w.
These escapes (i.e., D
for not-a-digit) cannot be used inside a bracketed character set.
There are two special character classes, [[:<:]
and [[:>:]]
, that match the beginning and end of a word, respectively. A word is defined as one or more characters that match w.
The *, +, and ? characters are quantifiers that specify repetition. By default these match as many characters as possible, which is called greedy matching. A nongreedy match will match as few characters as possible. You can specify nongreedy matching by putting a question mark after these quantifiers. Consider the pattern to match “one or more of not-a-newline followed by a newline.” The not-a-newline must be explicit with the greedy quantifier, as in:
[^ ]+
Otherwise, if the pattern were just
.+
then the "." could well match newlines, so the pattern would greedily consume everything until the very last newline in the input. A nongreedy match would be satisfied with the very first newline instead:
.+?
By using the nongreedy quantifier we've cut the pattern from eight characters to five. Another example that is shorter with a nongreedy quantifier is the HTML example from page 148. The following pattern also matches everything between <td>
and </td>
:
<td>(.*?)</td>
Even ?
can be made nongreedy, ??
, which means it prefers to match zero instead of one. This only makes sense inside the context of a larger pattern. Send me email if you have a compelling example for it!
The {m,n}
syntax is a quantifier that means match at least m
and at most n
of the previous matching item. There are two variations on this syntax. A simple {m}
means match exactly m
of the previous matching item. A {m,}
means match m
or more of the previous matching item. All of these can be made nongreedy by adding a ? after them.
A back reference is a feature you cannot easily get with basic regular expressions. A back reference matches the value of a subpattern captured with parentheses. If you have several sets of parentheses you can refer back to different captured expressions with 1, 2, and so on. You count by left parentheses to determine the reference.
For example, suppose you want to match a quoted string, where you can use either single or double quotes. You need to use an alternation of two patterns to match strings that are enclosed in double quotes or in single quotes:
("[^"]*"|'[^']*')
With a back reference, 1
, the pattern becomes simpler:
('|").*?1
The first set of parenthesis matches the leading quote, and then the 1 refers back to that particular quote character. The nongreedy quantifier ensures that the pattern matches up to the first occurrence of the matching quote.
Look-ahead patterns are subexpressions that are matched but do not consume any of the input. They act like constraints on the rest of the pattern, and they typically occur at the end of your pattern. A positive look-ahead causes the pattern to match if it also matches. A negative look-ahead causes the pattern to match if it would not match. These constraints make more sense in the context of matching variables and in regular expression substitutions done with the regsub
command. For example, the following pattern matches a filename that begins with A
and ends with .txt
^A.*.txt$
The next version of the pattern adds parentheses to group the file name suffix.
^A.*(.txt$)
The parentheses are not strictly necessary, but they are introduced so that we can compare the pattern to one that uses look-ahead. A version of the pattern that uses look-ahead looks like this:
^A.*(?=.txt$)
The pattern with the look-ahead constraint matches only the part of the filename before the .txt
, but only if the .txt
is present. In other words, the .txt
is not consumed by the match. This is visible in the value of the matching variables used with the regexp
command. It would also affect the substitutions done in the regsub
command.
There is negative look-ahead too. The following pattern matches a filename that begins with A
and does not end with .txt
.
^A.*(?!.txt$)
Writing this pattern without negative look-ahead is awkward.
The nn
and mmm
syntax, where n
and m
are digits, can also mean an 8-bit character code corresponding to the octal value nn
or mmm
. This has priority over a back reference. However, I just wouldn't use this notation for character codes. Instead, use the Unicode escape sequence, u
nnnn
, which specifies a 16-bit value. The x
nn
sequence also specifies an 8-bit character code. Unfortunately, the x
escape consumes all hex digits after it (not just two!) and then truncates the hexadecimal value down to 8 bits. This misfeature of x
is not considered a bug and will probably not change even in future versions of Tcl.
The U
yyyyyyyy
syntax is reserved for 32-bit Unicode, but I don't expect to see that implemented anytime soon.
Collating elements are characters or long names for characters that you can use inside character sets. Currently, Tcl only has some long names for various ASCII punctuation characters. Potentially, it could support names for every Unicode character, but it doesn't because the mapping tables would be huge. This section will briefly mention the syntax so that you can understand it if you see it. But its usefulness is still limited.
Within a bracketed expression, the following syntax is used to specify a collating element:
[.identifier.]
The identifier can be a character or a long name. The supported long names can be found in the generic/regc_locale.c
file in the Tcl source code distribution. A few examples are shown below:
[.c.] [.#.] [.number-sign.]
An equivalence class is all characters that sort to the same position. This is another feature that has limited usefulness in the current version of Tcl. In Tcl, characters sort by their Unicode character value, so there are no equivalence classes that contain more than one character! However, you could imagine a character class for 'o', 'ò', and other accented versions of the letter o. The syntax for equivalence classes within bracketed expressions is:
[=char=]
where char
is any one of the characters in the character class. This syntax is valid only inside a character class definition.
By default, the newline character is just an ordinary character to the matching engine. You can make the newline character special with two options: lineanchor
and linestop
. You can set these options with flags to the regexp
and regsub
Tcl commands, or you can use the embedded options described later in Table 11-5 on page 157.
The lineanchor
option makes the ^
and $
anchors work relative to newlines. The ^
matches immediately after a newline, and $
matches immediately before a newline. These anchors continue to match the very beginning and end of the input, too. With or without the lineanchor
option, you can use A
and to match the beginning and end of the string.
The linestop
option prevents .
(i.e., period) and character sets that begin with ^
from matching a newline character. In other words, unless you explicitly include
in your pattern, it will not match across newlines.
You can start a pattern with embedded options to turn on or off case sensitivity, newline sensitivity, and expanded syntax, which is explained in the next section. You can also switch from advanced regular expressions to a literal string, or to older forms of regular expressions. The syntax is a leading:
(?chars)
where chars
is any number of option characters. The option characters are listed in Table 11-5 on page 157.
Expanded syntax lets you include comments and extra white space in your patterns. This can greatly improve the readability of complex patterns. Expanded syntax is turned on with a regexp
command option or an embedded option.
Comments start with a # and run until the end of line. Extra white space and comments can occur anywhere except inside bracketed expressions (i.e., character sets) or within multicharacter syntax elements like (?=
. When you are in expanded mode, you can turn off the comment character or include an explicit space by preceding them with a backslash. Example 11-1 shows a pattern to match URLs. The leading (?x)
turns on expanded syntax. The whole pattern is grouped in curly braces to hide it from Tcl. This example is considered again in more detail in Example 11-3 on page 159:
Table 11-1 summarizes the syntax of regular expressions available in all versions of Tcl:
Table 11-1. Basic regular expression syntax
Matches any character. | |
Matches zero or more instances of the previous pattern item. | |
| Matches one or more instances of the previous pattern item. |
Matches zero or one instances of the previous pattern item. | |
Groups a subpattern. The repetition and alternation operators apply to the preceding subpattern. | |
Alternation. | |
Delimit a set of characters. Ranges are specified as [x-y]. If the first character in the set is | |
Anchor the pattern to the beginning of the string. Only when first. | |
Anchor the pattern to the end of the string. Only when last. |
Advanced regular expressions, which were introduced in Tcl 8.1, add more syntax that is summarized in Table 11-2:
Table 11-2. Additional advanced regular expression syntax
| Matches |
| Matches |
| Matches |
| Matches |
Matches | |
Matches | |
Matches zero or more instances of the previous pattern item. Nongreedy. | |
| Matches one or more instances of the previous pattern item. Nongreedy. |
Matches zero or one instances of the previous pattern item. Nongreedy. | |
Groups a subpattern, | |
Positive look-ahead. Matches the point where | |
Negative look-ahead. Matches the point where | |
Embedded options, where | |
One of many backslash escapes listed in Table 11-4. | |
Delimits a character class within a bracketed expression. See Table 11-3. | |
Delimits a collating element within a bracketed expression. | |
Delimits an equivalence class within a bracketed expression. |
Table 11-3 lists the named character classes defined in advanced regular expressions and their associated backslash sequences, if any. Character class names are valid inside bracketed character sets with the [:
class
:]
syntax.
Table 11-3. Character classes
Upper and lower case letters and digits. | |
Upper and lower case letters. | |
Space and tab. | |
Control characters: | |
The digits zero through nine. Also | |
Printing characters that are not in | |
Lowercase letters. | |
The same as | |
Punctuation characters. | |
Space, newline, carriage return, tab, vertical tab, form feed. Also | |
Uppercase letters. | |
Hexadecimal digits: zero through nine, a-f, A-F. |
Table 11-4 lists backslash sequences supported in Tcl 8.1.
Table 11-4. Backslash escapes in regular expressions
Alert, or "bell", character. | |
Matches only at the beginning of the string. | |
Backspace character, | |
Synonym for backslash. | |
Control- | |
Digits. Same as | |
Not a digit. Same as | |
Escape character, | |
Form feed, | |
Matches the beginning of a word. | |
Matches the end of a word. | |
Newline, | |
Carriage return, | |
Space. Same as | |
Not a space. Same as | |
Horizontal tab, | |
A 16-bit Unicode character code. | |
Vertical tab, | |
Letters, digit, and underscore. Same as | |
Not a letter, digit, or underscore. Same as | |
An 8-bit hexadecimal character code. Consumes all hex digits after | |
Matches the beginning or end of a word. | |
Matches a point that is not the beginning or end of a word. | |
Matches the end of the string. | |