This chapter describes string manipulation and simple pattern matching. Tcl commands described are: string
, append
, format
, scan
, and binary
. The string
command is a collection of several useful string manipulation operations.
Strings are the basic data item in Tcl, so it should not be surprising that there are a large number of commands to manipulate strings. A closely related topic is pattern matching, in which string comparisons are made more powerful by matching a string against a pattern. This chapter describes a simple pattern matching mechanism that is similar to that used in many other shell languages. Chapter 11 describes a more complex and powerful regular expression pattern matching mechanism.
The string
command is really a collection of operations you can perform on strings. The following example calculates the length of the value of a variable.
set name "Brent Welch"
string length $name
=> 11
The first argument to string
determines the operation. You can ask string
for valid operations by giving it a bad one:
string junk
=> bad option "junk": should be bytelength, compare, equal, first, index, is, last, length
, map, match, range, repeat, replace, tolower, totitle, toupper, trim, trimleft, trimright,
wordend, or wordstart
This trick of feeding a Tcl command bad arguments to find out its usage is common across many commands. Table 4-1 summarizes the string
command.
Table 4-1. The string
command
Returns the number of bytes used to store a string, which may be different from the character length returned by | |
Compares strings lexicographically. Use | |
Compares strings and returns 1 if they are the same. Use | |
Returns the index in | |
Returns the character at the specified | |
Returns 1 if | |
Returns the index in | |
Returns the number of characters in | |
Returns a new string created by mapping characters in | |
Returns 1 if | |
| Returns the range of characters in |
Returns | |
Returns a new string created by replacing characters | |
Returns | |
Capitalizes | |
Returns | |
Trims the characters in | |
Trims the characters in | |
Trims the characters in | |
Returns the index in | |
Returns the index in |
These are the string operations I use most:
The equal
operation, which is shown in Example 4-2 on page 53.
String match
. This pattern matching operation is described on page 53.
The tolower, totitle,
and toupper
operations convert case.
The trim
, trimright
, and trimleft
operations are handy for cleaning up strings.
These new operations were added in Tcl 8.1 (actually, they first appeared in the 8.1.1 patch release):
The equal
operation, which is simpler than using string compare
.
The is
operation that test for kinds of strings. String classes are listed in Table 4-3 on page 54.
The map
operation that translates characters (e.g., like the Unix tr command.)
The repeat
and replace
operations.
The totitle
operation, which is handy for capitalizing words.
Several of the string operations involve string indices that are positions within a string. Tcl counts characters in strings starting with zero. The special index end
is used to specify the last character in a string:
string range abcd 2 end
=> cd
Tcl 8.1 added syntax for specifying an index relative to the end. Specify end-
N
to get the N
th character before the end. For example, the following command returns a new string that drops the first and last characters from the original:
string range $string 1 end-1
There are several operations that pick apart strings: first
, last
, wordstart
, wordend
, index
, and range
. If you find yourself using combinations of these operations to pick apart data, it may be faster if you can do it with the regular expression pattern matcher described in Chapter 11.
Strings can be compared with expr
, if
, and while
using the comparison operators eq
, ne
, ==
, !=
, <
and >
. However, there are a number of subtle issues that can cause problems. First, you must quote the string value so that the expression parser can identify it as a string type. Then, you must group the expression with curly braces to prevent the double quotes from being stripped off by the main interpreter:
if {$x == "foo"} command
Despite the quotes, the expression operators that work on numbers and strings first convert try converting items to numbers if possible, and then converts them back if it detects a case of string comparison. The conversion back is always done as a decimal number. This can lead to unexpected conversions between strings that look like hexadecimal or octal numbers. The following boolean expression is true!
if {"0xa" == "10"} { puts stdout ack! }
=> ack!
A safe way to compare strings is to use the string compare
and string equal
operations. The eq
and ne expr
operators were introduced in 8.4 to allow more compact strict string comparison. These operations also work faster because the unnecessary conversions are eliminated. Like the C library strcmp
function, string compare
returns 0 if the strings are equal, minus 1 if the first string is lexicographically less than the second, or 1 if the first string is greater than the second:
The string equal
command added in Tcl 8.1 makes this simpler:
The eq
operator added in Tcl 8.4 is semantically equal, but more compact. It also avoids any internal format conversions. There is also a ne
operator to efficiently test for inequality.
The string match
command implements glob-style pattern matching that is modeled after the file name pattern matching done by various UNIX shells. The heritage of the word "glob" is rooted in UNIX, and Tcl preserves this historical oddity in the glob
command that does pattern matching on file names. The glob
command is described on page 122. Table 4-2 shows the three constructs used in string match
patterns:
Any other characters in a pattern are taken as literals that must match the input exactly. The following example matches all strings that begin with a
:
string match a* alpha
=> 1
To match all two-letter strings:
string match ?? XY
=> 1
To match all strings that begin with either a
or b
:
string match {[ab]*} cello
=> 0
Be careful! Square brackets are also special to the Tcl interpreter, so you will need to wrap the pattern up in curly braces to prevent it from being interpreted as a nested command. Another approach is to put the pattern into a variable:
set pat {[ab]*x}
string match $pat box
=> 1
You can specify a range of characters with the syntax [
x-y
]
. For example, [a-z]
represents the set of all lower-case letters, and [0-9]
represents all the digits. You can include more than one range in a set. Any letter, digit, or the underscore is matched with:
string match {[a-zA-Z0-9_]} $char
The set matches only a single character. To match more complicated patterns, like one or more characters from a set, then you need to use regular expression matching, which is described on page 158.
If you need to include a literal *
, ?
, or bracket in your pattern, preface it with a backslash:
string match {*?} what?
=> 1
In this case the pattern is quoted with curly braces because the Tcl interpreter is also doing backslash substitutions. Without the braces, you would have to use two backslashes. They are replaced with a single backslash by Tcl before string match
is called.
string match *\? what?
The string is
command tests a string to see whether it belongs to a particular class. This is useful for input validation. For example, to make sure something is a number, you do:
if {![string is integer -strict $input]} { error "Invalid input. Please enter a number." }
Classes are defined in terms of the Unicode character set, which means they are more general than specifying character sets with ranges over the ASCII encoding. For example, alpha
includes many characters outside the range of [A-Za-z]
because of different characters in other alphabets. The classes are listed in Table 4-3.
Table 4-3. Character class names
Any alphabet or digit character. | |
Any alphabet character. | |
Any character with a 7-bit character code (i.e., less than 128.) | |
A valid Tcl boolean value, such as | |
Character code less than 32, and not NULL. | |
Any digit character. | |
A valid floating point number. | |
A valid Tcl boolean false value, such as | |
Any printing characters, not including space characters. | |
A valid integer. | |
A string in all lower case. | |
A synonym for | |
Any punctuation character. | |
Space, tab, newline, carriage return, vertical tab, backspace. | |
A valid Tcl boolean true value, such as | |
A string all in upper case. | |
Alphabet, digit, and the underscore. | |
Valid hexadecimal digits. |
The string map
command translates a string based on a character map. The map is in the form of a input, output list. Wherever a string contains an input sequence, that is replaced with the corresponding output. For example:
string map {f p d l} food
=> pool
The inputs and outputs can be more than one character and they do not have to be the same length:
string map {f p d ll oo u} food
=> pull
Example 4-4 is more practical. It uses string map
to replace fancy quotes and hyphens produced by Microsoft Word into ASCII equivalents. It uses the open
, read
, and close
file operations that are described in Chapter 9, and the fconfigure command described on page 234 to ensure that the file format is UNIX friendly.
The append
command takes a variable name as its first argument and concatenates its remaining arguments onto the current value of the named variable. The variable is created if it does not already exist:
set foo z
append foo a b c
set foo
=> zabc
The append
command provides an efficient way to add items to the end of a string. It modifies a variable directly, so it can exploit the memory allocation scheme used internally by Tcl. Using the append
command like this:
append x " some new stuff"
is always faster than this:
set x "$x some new stuff"
The lappend command described on page 65 has similar performance benefits when working with Tcl lists.
The format
command is similar to the C printf
function. It formats a string according to a format specification:
format spec value1 value2 ...
The spec
argument includes literals and keywords. The literals are placed in the result as is, while each keyword indicates how to format the corresponding argument. The keywords are introduced with a percent sign, %,
followed by zero or more modifiers, and terminate with a conversion specifier. The most general keyword specification for each argument contains up to six parts:
position specifier
flags
field width
precision
word length
conversion character
Example keywords include %f
for floating point, %d
for integer, and %s
for string format. Use %%
to obtain a single percent character. The following examples use double quotes around the format
specification. This is because often the format contains white space, so grouping is required, as well as backslash substitutions like
or
, and the quotes allow substitution of these special characters. Table 4-4 lists the conversion characters:
Table 4-4. Format conversions
Signed integer. | |
Unsigned integer. | |
Signed integer. The argument may be in hex (0x) or octal (0) format. | |
Unsigned octal. | |
Unsigned hexadecimal. ' | |
Map from an integer to the ASCII character it represents. | |
A string. | |
Floating point number in the format | |
Floating point number in scientific notation, | |
Floating point number in either |
A position specifier is i$
, which means take the value from argument i
as opposed to the normally corresponding argument. The position counts from 1. If a position is specified for one format keyword, the position must be used for all of them. If you group the format specification with double quotes, you need to quote the $
with a backslash:
set lang 2
format "%${lang}$s" one un uno
=> un
The position specifier is useful for picking a string from a set, such as this simple language-specific example. The message catalog facility described in Chapter 15 is a much more sophisticated way to solve this problem. The position is also useful if the same value is repeated in the formatted string.
The flags in a format are used to specify padding and justification. In the following examples, the #
causes a leading 0x
to be printed in the hexadecimal value. The zero in 08
causes the field to be padded with zeros. Table 4-5 summarizes the format flag characters.
format "%#x" 20 => 0x14 format "%#08x" 10 => 0x0000000a
After the flags you can specify a minimum field width value. The value is padded to this width with spaces, or with zeros if the 0 flag is used:
Table 4-5. Format flags
Left justify the field. | |
| Always include a sign, either + or -. |
Precede a number with a space, unless the number has a leading sign. Useful for packing numbers close together. | |
| Pad with zeros. |
Leading 0 for octal. Leading 0x for hex. Always include a decimal point in floating point. Do not remove trailing zeros (%g). |
format "%-20s %3d" Label 2
=> Label 2
You can compute a field width and pass it to format
as one of the arguments by using *
as the field width specifier. In this case the next argument is used as the field width instead of the value, and the argument after that is the value that gets formatted.
set maxl 8
format "%-*s = %s" $maxl Key Value
=> Key = Value
The precision comes next, and it is specified with a period and a number. For %f
and %e
it indicates how many digits come after the decimal point. For %g
it indicates the total number of significant digits used. For %d
and %x
it indicates how many digits will be printed, padding with zeros if necessary.
format "%6.2f %6.2d" 1 1
=> 1.00 01
The storage length part comes last but it only became useful in Tcl 8.4 where wide integer support was added. Otherwise Tcl maintains all floating point values in double-precision, and all integers as long words. Wide integers are a minimum of 64-bits wide. By adding the l
(long) word length specifier, we can see the difference between regular and wide integers.
format %u -1 => 4294967295 format %lu -1 => 18446744073709551615
The scan
command parses a string according to a format specification and assigns values to variables. It returns the number of successful conversions it made, unless no capture variables are given, in which case it returns the scan matches in a list. The general form of the command is:
scan string format ?var? ?var? ?var? ...
The format for scan
is nearly the same as in the format
command. The %c
scan format converts one character to its decimal value.
The scan
format includes a set notation. Use square brackets to delimit a set of characters. The set matches one or more characters that are copied into the variable. A dash is used to specify a range. The following scans a field of all lowercase letters.
scan abcABC {%[a-z]} result => 1 set result => abc
If the first character in the set is a right square bracket, then it is considered part of the set. If the first character in the set is ^
, then characters not in the set match. Again, put a right square bracket immediately after the ^
to include it in the set. Nothing special is required to include a left square bracket in the set. As in the previous example, you will want to protect the format with braces, or use backslashes, because square brackets are special to the Tcl parser.
Tcl 8.0 added support for binary strings. Previous versions of Tcl used null-terminated strings internally, which foils the manipulation of some types of data. Tcl now uses counted strings, so it can tolerate a null byte in a string value without truncating it.
This section describes the binary
command that provides conversions between strings and packed binary data representations. The binary format
command takes values and packs them according to a template. For example, this can be used to format a floating point vector in memory suitable for passing to Fortran. The resulting binary value is returned:
binary format template value ?value ...?
The binary scan
command extracts values from a binary string according to a similar template. For example, this is useful for extracting data stored in binary data file. It assigns values to a set of Tcl variables:
binary scan value template variable ?variable ...?
The format template consists of type keys and counts. The count is interpreted differently depending on the type. For types like integer (i
) and double (d
), the count is a repetition count (e.g., i3
means three integers). For strings, the count is a length (e.g., a3
means a three-character string). If no count is specified, it defaults to 1. If count is *
, then binary scan
uses all the remaining bytes in the value.
Several type keys can be specified in a template. Each key-count combination moves an imaginary cursor through the binary data. There are special type keys to move the cursor. The x
key generates null bytes in binary format,
and it skips over bytes in binary scan
. The @
key uses its count
as an absolute byte offset to which to set the cursor. As a special case, @*
skips to the end of the data. The X
key backs up count
bytes. The types are summarized in Table 4-6. In the table, count
is the optional count following the type letter.
Table 4-6. Binary conversion types
| A character string of length |
A character string of length | |
| A binary string of length |
A binary string of length | |
| A hexadecimal string of length |
A hexadecimal string of length | |
An 8-bit character code. The | |
| A 16-bit integer in little-endian byte order. The |
A 16-bit integer in big-endian byte order. The | |
| A 32-bit integer in little-endian byte order. The |
A 32-bit integer in big-endian byte order. The | |
Single-precision floating point value in native format.The | |
Double-precision floating point value in native format. The | |
A 64-bit integer in little-endian byte order. The | |
| A 64-bit integer in big-endian byte order. The |
| Pack Skip |
Backup | |
Skip to absolute position specified by |
Numeric types have a particular byte order that determines how their value is laid out in memory. The type keys are lowercase for little-endian byte order (e.g., Intel) and uppercase for big-endian byte order (e.g., SPARC and Motorola). Different integer sizes are 16-bit (s
or S
), 32-bit (i
or I
), and, with Tcl 8.4 or greater, 64-bit (w
or W
). Note that the official byte order for data transmitted over a network is big-endian. Floating point values are always machine-specific, so it only makes sense to format and scan these values on the same machine.
There are three string types: character (a
or A
), binary (b
or B
), and hexadecimal (h
or H
). With these types the count
is the length of the string. The a
type pads its value to the specified length with null bytes in binary format
and the A
type pads its value with spaces. If the value is too long, it is truncated. In binary scan
, the A
type strips trailing blanks and nulls.
A binary string consists of zeros and ones. The b
type specifies bits from low-to-high order, and the B
type specifies bits from high-to-low order. A hexadecimal string specifies 4 bits (i.e., nybbles) with each character. The h
type specifies nybbles from low-to-high order, and the H
type specifies nybbles from high-to-low order. The B
and H
formats match the way you normally write out numbers.
When you experiment with binary format
and binary scan
, remember that Tcl treats things as strings by default. A "6", for example, is the character 6 with character code 54 or 0x36. The c
type returns these character codes:
set input 6
binary scan $input "c" 6val
set 6val
=> 54
You can scan several character codes at a time:
binary scan abc "c3" list => 1 set list => 97 98 99
The previous example uses a single type key, so binary scan
sets one corresponding Tcl variable. If you want each character code in a separate variable, use separate type keys:
binary scan abc "ccc" x y z => 3 set z => 99
Use the H
format to get hexadecimal values:
binary scan 6 "H2" 6val
set 6val
=> 36
Use the a
and A
formats to extract fixed width fields. Here the *
count is used to get all the rest of the string. Note that A
trims trailing spaces:
binary scan "hello world " a3x2A* first second
puts ""$first" "$second""
=> "hel" " world"
Use the @
key to seek to a particular offset in a value. The following command gets the second double-precision number from a vector. Assume the vector is read from a binary data file:
binary scan $vector "@8d" double
With binary format
, the a
and A
types create fixed width fields. A
pads its field with spaces, if necessary. The value is truncated if the string is too long:
binary format "A9A3" hello world
=> hello wor
An array of floating point values can be created with this command:
binary format "f*" 1.2 3.45 7.43 -45.67 1.03e4
Remember that floating point values are always in native format, so you have to read them on the same type of machine that they were created. With integer data you specify either big-endian or little-endian formats. The tcl_platform variable described on page 193 can tell you the byte order of the current platform.
When working with binary data in files, you need to turn off the newline translations and character set encoding that Tcl performs automatically. These are described in more detail on pages 120 and 219. For example, if you are generating binary data, the following command puts your standard output in binary mode:
fconfigure stdout -translation binary -encoding binary puts [binary format "B8" 11001010]
To learn more about manipulating data in Tcl, read about lists in Chapter 5 and arrays in Chapter 8.
For more about pattern matching, read about regular expressions in Chapter 11.
For more about file I/O, see Chapter 9.
For information on Unicode and other Internationalization issues, see Chapter 15.
3.16.130.201