Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. String Processing in Tcl

This chapter describes string manipulation and simple pattern matching. Tcl commands described are: string, append, format, scan, and binary. The string command is a collection of several useful string manipulation operations.

Strings are the basic data item in Tcl, so it should not be surprising that there are a large number of commands to manipulate strings. A closely related topic is pattern matching, in which string comparisons are made more powerful by matching a string against a pattern. This chapter describes a simple pattern matching mechanism that is similar to that used in many other shell languages. Chapter 11 describes a more complex and powerful regular expression pattern matching mechanism.

The `string` Command

The string command is really a collection of operations you can perform on strings. The following example calculates the length of the value of a variable.

set name "Brent Welch"
string length $name
=> 11

The first argument to string determines the operation. You can ask string for valid operations by giving it a bad one:

string junk
=> bad option "junk": should be bytelength, compare, equal, first, index, is, last, length
, map, match, range, repeat, replace, tolower, totitle, toupper, trim, trimleft, trimright,
 wordend, or wordstart

This trick of feeding a Tcl command bad arguments to find out its usage is common across many commands. Table 4-1 summarizes the string command.

Table 4-1. The string command

`string bytelength` `str`	Returns the number of bytes used to store a string, which may be different from the character length returned by `string length` because of UTF-8 encoding. See page 220 of Chapter 15 about Unicode and UTF-8.
`string compare ?-nocase? ?-length` `len``?` `str1 str2`	Compares strings lexicographically. Use `-nocase` for case insensitive comparison. Use `-length` to limit the comparison to the first `len` characters. Returns 0 if equal, -1 if `str1` sorts before `str2`, else 1.
`string equal ?-nocase?` `str1 str2`	Compares strings and returns 1 if they are the same. Use `-nocase` for case insensitive comparison.
`string first` `subString string ?startIndex?`	Returns the index in `string` of the first occurrence of `subString`, or -1 if `string` is not found. `startIndex` may be specified to start in the middle of `string`.
`string index` `string index`	Returns the character at the specified `index`. An index counts from zero. Use `end` for the last character.
`string is` `class` `?-strict? ?-failindex` `varname``?` `string`	Returns 1 if `string` belongs to `class`. If `-strict`, then empty strings never match, otherwise they always match. If `-failindex` is specified, then `varname` is assigned the index of the character in `string` that prevented it from being a member of `class`. See Table 4-3 on page 54 for character class names.
`string last` `subString string ?startIndex?`	Returns the index in `string` of the last occurrence of `subString`, or -1 if `subString` is not found. `startIndex` may be specified to start in the middle of `string`.
`string length` `string`	Returns the number of characters in `string`.
`string map ?-nocase?` `charMap string`	Returns a new string created by mapping characters in `string` according to the input, output list in `charMap`. See page 55.
`string match ?-nocase?` `pattern str`	Returns 1 if `str` matches the `pattern`, else 0. Glob-style matching is used. See page 53.
`string range` `str i j`	Returns the range of characters in `str` from `i` to `j`.
`string repeat` `str count`	Returns `str` repeated `count` times.
`string replace` `str first last` `?``newstr``?`	Returns a new string created by replacing characters `first` through `last` with `newstr`, or nothing.
`string tolower` `string` `?``first``? ?``last``?`	Returns `string` in lower case. `first` and `last` determine the range of `string` on which to operate.
`string totitle` `string` `?``first``? ?``last``?`	Capitalizes `string` by replacing its first character with the Unicode title case, or upper case, and the rest with lower case. `first` and `last` determine the range of `string` on which to operate.
`string toupper` `string` `?``first``? ?``last``?`	Returns `string` in upper case. `first` and `last` determine the range of `string` on which to operate.
`string trim` `string` `?``chars``?`	Trims the characters in `chars` from both ends of `string`. `chars` defaults to whitespace.
`string trimleft` `string` `?``chars``?`	Trims the characters in `chars` from the beginning of `string`. `chars` defaults to whitespace.
`string trimright` `string` `?``chars``?`	Trims the characters in `chars` from the end of `string`. `chars` defaults to whitespace.
`string wordend` `str ix`	Returns the index in `str` of the character after the word containing the character at index `ix`.
`string wordstart` `str ix`	Returns the index in `str` of the first character in the word containing the character at index `ix`.

These are the string operations I use most:

The equal operation, which is shown in Example 4-2 on page 53.
String match. This pattern matching operation is described on page 53.
The tolower, totitle, and toupper operations convert case.
The trim, trimright, and trimleft operations are handy for cleaning up strings.

These new operations were added in Tcl 8.1 (actually, they first appeared in the 8.1.1 patch release):

The equal operation, which is simpler than using string compare.
The is operation that test for kinds of strings. String classes are listed in Table 4-3 on page 54.
The map operation that translates characters (e.g., like the Unix tr command.)
The repeat and replace operations.
The totitle operation, which is handy for capitalizing words.

String Indices

Several of the string operations involve string indices that are positions within a string. Tcl counts characters in strings starting with zero. The special index end is used to specify the last character in a string:

string range abcd 2 end
=> cd

Tcl 8.1 added syntax for specifying an index relative to the end. Specify end-N to get the Nth character before the end. For example, the following command returns a new string that drops the first and last characters from the original:

string range $string 1 end-1

There are several operations that pick apart strings: first, last, wordstart, wordend, index, and range. If you find yourself using combinations of these operations to pick apart data, it may be faster if you can do it with the regular expression pattern matcher described in Chapter 11.

Strings and Expressions

Strings can be compared with expr, if, and while using the comparison operators eq, ne, ==, !=, < and >. However, there are a number of subtle issues that can cause problems. First, you must quote the string value so that the expression parser can identify it as a string type. Then, you must group the expression with curly braces to prevent the double quotes from being stripped off by the main interpreter:

if {$x == "foo"} command

Note

expr is only reliable for string comparison when using eq or ne.

Despite the quotes, the expression operators that work on numbers and strings first convert try converting items to numbers if possible, and then converts them back if it detects a case of string comparison. The conversion back is always done as a decimal number. This can lead to unexpected conversions between strings that look like hexadecimal or octal numbers. The following boolean expression is true!

if {"0xa" == "10"} { puts stdout ack! }
=> ack!

A safe way to compare strings is to use the string compare and string equal operations. The eq and ne expr operators were introduced in 8.4 to allow more compact strict string comparison. These operations also work faster because the unnecessary conversions are eliminated. Like the C library strcmp function, string compare returns 0 if the strings are equal, minus 1 if the first string is lexicographically less than the second, or 1 if the first string is greater than the second:

Example 4-1. Comparing strings with string compare

if {[string compare $s1 $s2] == 0} {
   # strings are equal
}

The string equal command added in Tcl 8.1 makes this simpler:

Example 4-2. Comparing strings with string equal

if {[string equal $s1 $s2]} {
   # strings are equal
}

The eq operator added in Tcl 8.4 is semantically equal, but more compact. It also avoids any internal format conversions. There is also a ne operator to efficiently test for inequality.

Example 4-3. Comparing strings with eq

if {$s1 eq $s2} {
   # strings are equal
}

String Matching

The string match command implements glob-style pattern matching that is modeled after the file name pattern matching done by various UNIX shells. The heritage of the word "glob" is rooted in UNIX, and Tcl preserves this historical oddity in the glob command that does pattern matching on file names. The glob command is described on page 122. Table 4-2 shows the three constructs used in string match patterns:

Table 4-2. Matching characters used with string match

`*`	Match any number of any characters.
`?`	Match exactly one character.
`[``chars``]`	Match any character in `chars`.

Any other characters in a pattern are taken as literals that must match the input exactly. The following example matches all strings that begin with a:

string match a* alpha
=> 1

To match all two-letter strings:

string match ?? XY
=> 1

To match all strings that begin with either a or b:

string match {[ab]*} cello
=> 0

Be careful! Square brackets are also special to the Tcl interpreter, so you will need to wrap the pattern up in curly braces to prevent it from being interpreted as a nested command. Another approach is to put the pattern into a variable:

set pat {[ab]*x}
string match $pat box
=> 1

You can specify a range of characters with the syntax [x-y]. For example, [a-z] represents the set of all lower-case letters, and [0-9] represents all the digits. You can include more than one range in a set. Any letter, digit, or the underscore is matched with:

string match {[a-zA-Z0-9_]} $char

The set matches only a single character. To match more complicated patterns, like one or more characters from a set, then you need to use regular expression matching, which is described on page 158.

If you need to include a literal *, ?, or bracket in your pattern, preface it with a backslash:

string match {*?} what?
=> 1

In this case the pattern is quoted with curly braces because the Tcl interpreter is also doing backslash substitutions. Without the braces, you would have to use two backslashes. They are replaced with a single backslash by Tcl before string match is called.

string match *\? what?

Character Classes

The string is command tests a string to see whether it belongs to a particular class. This is useful for input validation. For example, to make sure something is a number, you do:

if {![string is integer -strict $input]} {
    error "Invalid input. Please enter a number."
}

Classes are defined in terms of the Unicode character set, which means they are more general than specifying character sets with ranges over the ASCII encoding. For example, alpha includes many characters outside the range of [A-Za-z] because of different characters in other alphabets. The classes are listed in Table 4-3.

Table 4-3. Character class names

`alnum`	Any alphabet or digit character.
`alpha`	Any alphabet character.
`ascii`	Any character with a 7-bit character code (i.e., less than 128.)
`boolean`	A valid Tcl boolean value, such as `0`, `1`, `true`, `false` (in any case).
`control`	Character code less than 32, and not NULL.
`digit`	Any digit character.
`double`	A valid floating point number.
`false`	A valid Tcl boolean false value, such as `0` or `false` (in any case).
`graph`	Any printing characters, not including space characters.
`integer`	A valid integer.
`lower`	A string in all lower case.
`print`	A synonym for `alnum`.
`punct`	Any punctuation character.
`space`	Space, tab, newline, carriage return, vertical tab, backspace.
`true`	A valid Tcl boolean true value, such as `1` or `true` (in any case).
`upper`	A string all in upper case.
`wordchar`	Alphabet, digit, and the underscore.
`xdigit`	Valid hexadecimal digits.

Mapping Strings

The string map command translates a string based on a character map. The map is in the form of a input, output list. Wherever a string contains an input sequence, that is replaced with the corresponding output. For example:

string map {f p d l} food
=> pool

The inputs and outputs can be more than one character and they do not have to be the same length:

string map {f p d ll oo u} food
=> pull

Example 4-4 is more practical. It uses string map to replace fancy quotes and hyphens produced by Microsoft Word into ASCII equivalents. It uses the open, read, and close file operations that are described in Chapter 9, and the fconfigure command described on page 234 to ensure that the file format is UNIX friendly.

Example 4-4. Mapping Microsoft World special characters to ASCII

proc Dos2Unix {filename} {
   set input [open $filename]
   set output [open $filename.new]
   fconfigure $output -translation lf
   puts $output [string map {
      223   "
      224   "
      222   '
      226   -
   } [read $input]]
   close $input
   close $output
}

The `append` Command

The append command takes a variable name as its first argument and concatenates its remaining arguments onto the current value of the named variable. The variable is created if it does not already exist:

set foo z
append foo a b c
set foo
=> zabc

Note

The append command is efficient with large strings.

The append command provides an efficient way to add items to the end of a string. It modifies a variable directly, so it can exploit the memory allocation scheme used internally by Tcl. Using the append command like this:

append x " some new stuff"

is always faster than this:

set x "$x some new stuff"

The lappend command described on page 65 has similar performance benefits when working with Tcl lists.

The `format` Command

The format command is similar to the C printf function. It formats a string according to a format specification:

format spec value1 value2 ...

The spec argument includes literals and keywords. The literals are placed in the result as is, while each keyword indicates how to format the corresponding argument. The keywords are introduced with a percent sign, %, followed by zero or more modifiers, and terminate with a conversion specifier. The most general keyword specification for each argument contains up to six parts:

position specifier
flags
field width
precision
word length
conversion character

Example keywords include %f for floating point, %d for integer, and %s for string format. Use %% to obtain a single percent character. The following examples use double quotes around the format specification. This is because often the format contains white space, so grouping is required, as well as backslash substitutions like or , and the quotes allow substitution of these special characters. Table 4-4 lists the conversion characters:

Table 4-4. Format conversions

`d`	Signed integer.
`u`	Unsigned integer.
`i`	Signed integer. The argument may be in hex (0x) or octal (0) format.
`o`	Unsigned octal.
`x or X`	Unsigned hexadecimal. '`x`' gives lowercase results.
`c`	Map from an integer to the ASCII character it represents.
`s`	A string.
`f`	Floating point number in the format `a.b.`
`e or E`	Floating point number in scientific notation, `a.bE+-c.`
`g or G`	Floating point number in either `%f` or `%e` format, whichever is shorter.

A position specifier is i$, which means take the value from argument i as opposed to the normally corresponding argument. The position counts from 1. If a position is specified for one format keyword, the position must be used for all of them. If you group the format specification with double quotes, you need to quote the $ with a backslash:

set lang 2
format "%${lang}$s" one un uno
=> un

The position specifier is useful for picking a string from a set, such as this simple language-specific example. The message catalog facility described in Chapter 15 is a much more sophisticated way to solve this problem. The position is also useful if the same value is repeated in the formatted string.

The flags in a format are used to specify padding and justification. In the following examples, the # causes a leading 0x to be printed in the hexadecimal value. The zero in 08 causes the field to be padded with zeros. Table 4-5 summarizes the format flag characters.

format "%#x" 20
=> 0x14
format "%#08x" 10
=> 0x0000000a

After the flags you can specify a minimum field width value. The value is padded to this width with spaces, or with zeros if the 0 flag is used:

Table 4-5. Format flags

`-`	Left justify the field.
`+`	Always include a sign, either + or -.
`space`	Precede a number with a space, unless the number has a leading sign. Useful for packing numbers close together.
`0`	Pad with zeros.
`#`	Leading 0 for octal. Leading 0x for hex. Always include a decimal point in floating point. Do not remove trailing zeros (%g).

format "%-20s %3d" Label 2
=> Label               2

You can compute a field width and pass it to format as one of the arguments by using * as the field width specifier. In this case the next argument is used as the field width instead of the value, and the argument after that is the value that gets formatted.

set maxl 8
format "%-*s = %s" $maxl Key Value
=> Key     = Value

The precision comes next, and it is specified with a period and a number. For %f and %e it indicates how many digits come after the decimal point. For %g it indicates the total number of significant digits used. For %d and %x it indicates how many digits will be printed, padding with zeros if necessary.

format "%6.2f %6.2d" 1 1
=>   1.00     01

The storage length part comes last but it only became useful in Tcl 8.4 where wide integer support was added. Otherwise Tcl maintains all floating point values in double-precision, and all integers as long words. Wide integers are a minimum of 64-bits wide. By adding the l (long) word length specifier, we can see the difference between regular and wide integers.

format %u -1
=> 4294967295
format %lu -1
=> 18446744073709551615

The `scan` Command

The scan command parses a string according to a format specification and assigns values to variables. It returns the number of successful conversions it made, unless no capture variables are given, in which case it returns the scan matches in a list. The general form of the command is:

scan string format ?var? ?var? ?var? ...

The format for scan is nearly the same as in the format command. The %c scan format converts one character to its decimal value.

The scan format includes a set notation. Use square brackets to delimit a set of characters. The set matches one or more characters that are copied into the variable. A dash is used to specify a range. The following scans a field of all lowercase letters.

scan abcABC {%[a-z]} result
=> 1
set result
=> abc

If the first character in the set is a right square bracket, then it is considered part of the set. If the first character in the set is ^, then characters not in the set match. Again, put a right square bracket immediately after the ^ to include it in the set. Nothing special is required to include a left square bracket in the set. As in the previous example, you will want to protect the format with braces, or use backslashes, because square brackets are special to the Tcl parser.

The `binary` Command

Tcl 8.0 added support for binary strings. Previous versions of Tcl used null-terminated strings internally, which foils the manipulation of some types of data. Tcl now uses counted strings, so it can tolerate a null byte in a string value without truncating it.

This section describes the binary command that provides conversions between strings and packed binary data representations. The binary format command takes values and packs them according to a template. For example, this can be used to format a floating point vector in memory suitable for passing to Fortran. The resulting binary value is returned:

binary format template value ?value ...?

The binary scan command extracts values from a binary string according to a similar template. For example, this is useful for extracting data stored in binary data file. It assigns values to a set of Tcl variables:

binary scan value template variable ?variable ...?

Format Templates

The format template consists of type keys and counts. The count is interpreted differently depending on the type. For types like integer (i) and double (d), the count is a repetition count (e.g., i3 means three integers). For strings, the count is a length (e.g., a3 means a three-character string). If no count is specified, it defaults to 1. If count is *, then binary scan uses all the remaining bytes in the value.

Several type keys can be specified in a template. Each key-count combination moves an imaginary cursor through the binary data. There are special type keys to move the cursor. The x key generates null bytes in binary format, and it skips over bytes in binary scan. The @ key uses its count as an absolute byte offset to which to set the cursor. As a special case, @* skips to the end of the data. The X key backs up count bytes. The types are summarized in Table 4-6. In the table, count is the optional count following the type letter.

Table 4-6. Binary conversion types

`a`	A character string of length `count`. Padded with nulls in `binary format`.
`A`	A character string of length `count`. Padded with spaces in `binary format`. Trailing nulls and blanks are discarded in `binary scan`.
`b`	A binary string of length `count`. Low-to-high order.
`B`	A binary string of length `count`. High-to-low order.
`h`	A hexadecimal string of length `count`. Low-to-high order.
`H`	A hexadecimal string of length `count`. High-to-low order. (More commonly used than `h`.)
`c`	An 8-bit character code. The `count` is for repetition.
`s`	A 16-bit integer in little-endian byte order. The `count` is for repetition.
`S`	A 16-bit integer in big-endian byte order. The `count` is for repetition.
`i`	A 32-bit integer in little-endian byte order. The `count` is for repetition.
`I`	A 32-bit integer in big-endian byte order. The `count` is for repetition.
`f`	Single-precision floating point value in native format.The `count` is for repetition.
`d`	Double-precision floating point value in native format. The `count` is for repetition.
`w`	A 64-bit integer in little-endian byte order. The `count` is for repetition. (Tcl 8.4)
`W`	A 64-bit integer in big-endian byte order. The `count` is for repetition. (Tcl 8.4)
`x`	Pack `count` null bytes with `binary format`. Skip `count` bytes with `binary scan`.
`X`	Backup `count` bytes.
`@`	Skip to absolute position specified by `count`. If `count` is `*`, skip to the end.

Numeric types have a particular byte order that determines how their value is laid out in memory. The type keys are lowercase for little-endian byte order (e.g., Intel) and uppercase for big-endian byte order (e.g., SPARC and Motorola). Different integer sizes are 16-bit (s or S), 32-bit (i or I), and, with Tcl 8.4 or greater, 64-bit (w or W). Note that the official byte order for data transmitted over a network is big-endian. Floating point values are always machine-specific, so it only makes sense to format and scan these values on the same machine.

There are three string types: character (a or A), binary (b or B), and hexadecimal (h or H). With these types the count is the length of the string. The a type pads its value to the specified length with null bytes in binary format and the A type pads its value with spaces. If the value is too long, it is truncated. In binary scan, the A type strips trailing blanks and nulls.

A binary string consists of zeros and ones. The b type specifies bits from low-to-high order, and the B type specifies bits from high-to-low order. A hexadecimal string specifies 4 bits (i.e., nybbles) with each character. The h type specifies nybbles from low-to-high order, and the H type specifies nybbles from high-to-low order. The B and H formats match the way you normally write out numbers.

Examples

When you experiment with binary format and binary scan, remember that Tcl treats things as strings by default. A "6", for example, is the character 6 with character code 54 or 0x36. The c type returns these character codes:

set input 6
binary scan $input "c" 6val
set 6val
=> 54

You can scan several character codes at a time:

binary scan abc "c3" list
=> 1
set list
=> 97 98 99

The previous example uses a single type key, so binary scan sets one corresponding Tcl variable. If you want each character code in a separate variable, use separate type keys:

binary scan abc "ccc" x y z
=> 3
set z
=> 99

Use the H format to get hexadecimal values:

binary scan 6 "H2" 6val
set 6val
=> 36

Use the a and A formats to extract fixed width fields. Here the * count is used to get all the rest of the string. Note that A trims trailing spaces:

binary scan "hello world " a3x2A* first second
puts ""$first" "$second""
=> "hel" " world"

Use the @ key to seek to a particular offset in a value. The following command gets the second double-precision number from a vector. Assume the vector is read from a binary data file:

binary scan $vector "@8d" double

With binary format, the a and A types create fixed width fields. A pads its field with spaces, if necessary. The value is truncated if the string is too long:

binary format "A9A3" hello world
=> hello    wor

An array of floating point values can be created with this command:

binary format "f*" 1.2 3.45 7.43 -45.67 1.03e4

Remember that floating point values are always in native format, so you have to read them on the same type of machine that they were created. With integer data you specify either big-endian or little-endian formats. The tcl_platform variable described on page 193 can tell you the byte order of the current platform.

Binary Data and File I/O

When working with binary data in files, you need to turn off the newline translations and character set encoding that Tcl performs automatically. These are described in more detail on pages 120 and 219. For example, if you are generating binary data, the following command puts your standard output in binary mode:

fconfigure stdout -translation binary -encoding binary
puts [binary format "B8" 11001010]

Related Chapters

To learn more about manipulating data in Tcl, read about lists in Chapter 5 and arrays in Chapter 8.
For more about pattern matching, read about regular expressions in Chapter 11.
For more about file I/O, see Chapter 9.
For information on Unicode and other Internationalization issues, see Chapter 15.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. String Processing in Tcl

Create new playlist

Sign In

Sign Up

Chapter 4. String Processing in Tcl

The string Command

String Indices

Strings and Expressions

Note

String Matching

Character Classes

Mapping Strings

The append Command

Note

The format Command

The scan Command

The binary Command

Format Templates

Examples

Binary Data and File I/O

Related Chapters

Table of Contents for
4. String Processing in Tcl

The `string` Command

The `append` Command

The `format` Command

The `scan` Command

The `binary` Command