Chapter 4. String Processing in Tcl

This chapter describes string manipulation and simple pattern matching. Tcl commands described are: string, append, format, scan, and binary. The string command is a collection of several useful string manipulation operations.

Strings are the basic data item in Tcl, so it should not be surprising that there are a large number of commands to manipulate strings. A closely related topic is pattern matching, in which string comparisons are made more powerful by matching a string against a pattern. This chapter describes a simple pattern matching mechanism that is similar to that used in many other shell languages. Chapter 11 describes a more complex and powerful regular expression pattern matching mechanism.

The string Command

The string command is really a collection of operations you can perform on strings. The following example calculates the length of the value of a variable.

set name "Brent Welch"
string length $name
=> 11

The first argument to string determines the operation. You can ask string for valid operations by giving it a bad one:

string junk
=> bad option "junk": should be bytelength, compare, equal, first, index, is, last, length
The string Command, map, match, range, repeat, replace, tolower, totitle, toupper, trim, trimleft, trimright,
The string Command wordend, or wordstart

This trick of feeding a Tcl command bad arguments to find out its usage is common across many commands. Table 4-1 summarizes the string command.

Table 4-1. The string command

string bytelength str

Returns the number of bytes used to store a string, which may be different from the character length returned by string length because of UTF-8 encoding. See page 220 of Chapter 15 about Unicode and UTF-8.

string compare ?-nocase? ?-length len? str1 str2

Compares strings lexicographically. Use -nocase for case insensitive comparison. Use -length to limit the comparison to the first len characters. Returns 0 if equal, -1 if str1 sorts before str2, else 1.

string equal ?-nocase? str1 str2

Compares strings and returns 1 if they are the same. Use -nocase for case insensitive comparison.

string first subString string ?startIndex?

Returns the index in string of the first occurrence of subString, or -1 if string is not found. startIndex may be specified to start in the middle of string.

string index string index

Returns the character at the specified index. An index counts from zero. Use end for the last character.

string is class ?-strict? ?-failindex varname? string

Returns 1 if string belongs to class. If -strict, then empty strings never match, otherwise they always match. If -failindex is specified, then varname is assigned the index of the character in string that prevented it from being a member of class. See Table 4-3 on page 54 for character class names.

string last subString string ?startIndex?

Returns the index in string of the last occurrence of subString, or -1 if subString is not found. startIndex may be specified to start in the middle of string.

string length string

Returns the number of characters in string.

string map ?-nocase? charMap string

Returns a new string created by mapping characters in string according to the input, output list in charMap. See page 55.

string match ?-nocase? pattern str

Returns 1 if str matches the pattern, else 0. Glob-style matching is used. See page 53.

string range str i j

Returns the range of characters in str from i to j.

string repeat str count

Returns str repeated count times.

string replace str first last ?newstr?

Returns a new string created by replacing characters first through last with newstr, or nothing.

string tolower string ?first? ?last?

Returns string in lower case. first and last determine the range of string on which to operate.

string totitle string ?first? ?last?

Capitalizes string by replacing its first character with the Unicode title case, or upper case, and the rest with lower case. first and last determine the range of string on which to operate.

string toupper string ?first? ?last?

Returns string in upper case. first and last determine the range of string on which to operate.

string trim string ?chars?

Trims the characters in chars from both ends of string. chars defaults to whitespace.

string trimleft string ?chars?

Trims the characters in chars from the beginning of string. chars defaults to whitespace.

string trimright string ?chars?

Trims the characters in chars from the end of string. chars defaults to whitespace.

string wordend str ix

Returns the index in str of the character after the word containing the character at index ix.

string wordstart str ix

Returns the index in str of the first character in the word containing the character at index ix.

These are the string operations I use most:

  • The equal operation, which is shown in Example 4-2 on page 53.

  • String match. This pattern matching operation is described on page 53.

  • The tolower, totitle, and toupper operations convert case.

  • The trim, trimright, and trimleft operations are handy for cleaning up strings.

These new operations were added in Tcl 8.1 (actually, they first appeared in the 8.1.1 patch release):

  • The equal operation, which is simpler than using string compare.

  • The is operation that test for kinds of strings. String classes are listed in Table 4-3 on page 54.

  • The map operation that translates characters (e.g., like the Unix tr command.)

  • The repeat and replace operations.

  • The totitle operation, which is handy for capitalizing words.

String Indices

Several of the string operations involve string indices that are positions within a string. Tcl counts characters in strings starting with zero. The special index end is used to specify the last character in a string:

string range abcd 2 end
=> cd

Tcl 8.1 added syntax for specifying an index relative to the end. Specify end-N to get the Nth character before the end. For example, the following command returns a new string that drops the first and last characters from the original:

string range $string 1 end-1

There are several operations that pick apart strings: first, last, wordstart, wordend, index, and range. If you find yourself using combinations of these operations to pick apart data, it may be faster if you can do it with the regular expression pattern matcher described in Chapter 11.

Strings and Expressions

Strings can be compared with expr, if, and while using the comparison operators eq, ne, ==, !=, < and >. However, there are a number of subtle issues that can cause problems. First, you must quote the string value so that the expression parser can identify it as a string type. Then, you must group the expression with curly braces to prevent the double quotes from being stripped off by the main interpreter:

if {$x == "foo"} command

Note

Strings and Expressionsstringexpressions

expr is only reliable for string comparison when using eq or ne.

Despite the quotes, the expression operators that work on numbers and strings first convert try converting items to numbers if possible, and then converts them back if it detects a case of string comparison. The conversion back is always done as a decimal number. This can lead to unexpected conversions between strings that look like hexadecimal or octal numbers. The following boolean expression is true!

if {"0xa" == "10"} { puts stdout ack! }
=> ack!

A safe way to compare strings is to use the string compare and string equal operations. The eq and ne expr operators were introduced in 8.4 to allow more compact strict string comparison. These operations also work faster because the unnecessary conversions are eliminated. Like the C library strcmp function, string compare returns 0 if the strings are equal, minus 1 if the first string is lexicographically less than the second, or 1 if the first string is greater than the second:

Example 4-1. Comparing strings with string compare

if {[string compare $s1 $s2] == 0} {
   # strings are equal
}

The string equal command added in Tcl 8.1 makes this simpler:

Example 4-2. Comparing strings with string equal

if {[string equal $s1 $s2]} {
   # strings are equal
}

The eq operator added in Tcl 8.4 is semantically equal, but more compact. It also avoids any internal format conversions. There is also a ne operator to efficiently test for inequality.

Example 4-3. Comparing strings with eq

if {$s1 eq $s2} {
   # strings are equal
}

String Matching

The string match command implements glob-style pattern matching that is modeled after the file name pattern matching done by various UNIX shells. The heritage of the word "glob" is rooted in UNIX, and Tcl preserves this historical oddity in the glob command that does pattern matching on file names. The glob command is described on page 122. Table 4-2 shows the three constructs used in string match patterns:

Table 4-2. Matching characters used with string match

*

Match any number of any characters.

?

Match exactly one character.

[chars]

Match any character in chars.

Any other characters in a pattern are taken as literals that must match the input exactly. The following example matches all strings that begin with a:

string match a* alpha
=> 1

To match all two-letter strings:

string match ?? XY
=> 1

To match all strings that begin with either a or b:

string match {[ab]*} cello
=> 0

Be careful! Square brackets are also special to the Tcl interpreter, so you will need to wrap the pattern up in curly braces to prevent it from being interpreted as a nested command. Another approach is to put the pattern into a variable:

set pat {[ab]*x}
string match $pat box
=> 1

You can specify a range of characters with the syntax [x-y]. For example, [a-z] represents the set of all lower-case letters, and [0-9] represents all the digits. You can include more than one range in a set. Any letter, digit, or the underscore is matched with:

string match {[a-zA-Z0-9_]} $char

The set matches only a single character. To match more complicated patterns, like one or more characters from a set, then you need to use regular expression matching, which is described on page 158.

If you need to include a literal *, ?, or bracket in your pattern, preface it with a backslash:

string match {*?} what?
=> 1

In this case the pattern is quoted with curly braces because the Tcl interpreter is also doing backslash substitutions. Without the braces, you would have to use two backslashes. They are replaced with a single backslash by Tcl before string match is called.

string match *\? what?

Character Classes

The string is command tests a string to see whether it belongs to a particular class. This is useful for input validation. For example, to make sure something is a number, you do:

if {![string is integer -strict $input]} {
    error "Invalid input. Please enter a number."
}

Classes are defined in terms of the Unicode character set, which means they are more general than specifying character sets with ranges over the ASCII encoding. For example, alpha includes many characters outside the range of [A-Za-z] because of different characters in other alphabets. The classes are listed in Table 4-3.

Table 4-3. Character class names

alnum

Any alphabet or digit character.

alpha

Any alphabet character.

ascii

Any character with a 7-bit character code (i.e., less than 128.)

boolean

A valid Tcl boolean value, such as 0, 1, true, false (in any case).

control

Character code less than 32, and not NULL.

digit

Any digit character.

double

A valid floating point number.

false

A valid Tcl boolean false value, such as 0 or false (in any case).

graph

Any printing characters, not including space characters.

integer

A valid integer.

lower

A string in all lower case.

print

A synonym for alnum.

punct

Any punctuation character.

space

Space, tab, newline, carriage return, vertical tab, backspace.

true

A valid Tcl boolean true value, such as 1 or true (in any case).

upper

A string all in upper case.

wordchar

Alphabet, digit, and the underscore.

xdigit

Valid hexadecimal digits.

Mapping Strings

The string map command translates a string based on a character map. The map is in the form of a input, output list. Wherever a string contains an input sequence, that is replaced with the corresponding output. For example:

string map {f p d l} food
=> pool

The inputs and outputs can be more than one character and they do not have to be the same length:

string map {f p d ll oo u} food
=> pull

Example 4-4 is more practical. It uses string map to replace fancy quotes and hyphens produced by Microsoft Word into ASCII equivalents. It uses the open, read, and close file operations that are described in Chapter 9, and the fconfigure command described on page 234 to ensure that the file format is UNIX friendly.

Example 4-4. Mapping Microsoft World special characters to ASCII

proc Dos2Unix {filename} {
   set input [open $filename]
   set output [open $filename.new]
   fconfigure $output -translation lf
   puts $output [string map {
      223   "
      224   "
      222   '
      226   -
   } [read $input]]
   close $input
   close $output
}

The append Command

The append command takes a variable name as its first argument and concatenates its remaining arguments onto the current value of the named variable. The variable is created if it does not already exist:

set foo z
append foo a b c
set foo
=> zabc

Note

The append Command

The append command is efficient with large strings.

The append command provides an efficient way to add items to the end of a string. It modifies a variable directly, so it can exploit the memory allocation scheme used internally by Tcl. Using the append command like this:

append x " some new stuff"

is always faster than this:

set x "$x some new stuff"

The lappend command described on page 65 has similar performance benefits when working with Tcl lists.

The format Command

The format command is similar to the C printf function. It formats a string according to a format specification:

format spec value1 value2 ...

The spec argument includes literals and keywords. The literals are placed in the result as is, while each keyword indicates how to format the corresponding argument. The keywords are introduced with a percent sign, %, followed by zero or more modifiers, and terminate with a conversion specifier. The most general keyword specification for each argument contains up to six parts:

  • position specifier

  • flags

  • field width

  • precision

  • word length

  • conversion character

Example keywords include %f for floating point, %d for integer, and %s for string format. Use %% to obtain a single percent character. The following examples use double quotes around the format specification. This is because often the format contains white space, so grouping is required, as well as backslash substitutions like or , and the quotes allow substitution of these special characters. Table 4-4 lists the conversion characters:

Table 4-4. Format conversions

d

Signed integer.

u

Unsigned integer.

i

Signed integer. The argument may be in hex (0x) or octal (0) format.

o

Unsigned octal.

x or X

Unsigned hexadecimal. 'x' gives lowercase results.

c

Map from an integer to the ASCII character it represents.

s

A string.

f

Floating point number in the format a.b.

e or E

Floating point number in scientific notation, a.bE+-c.

g or G

Floating point number in either %f or %e format, whichever is shorter.

A position specifier is i$, which means take the value from argument i as opposed to the normally corresponding argument. The position counts from 1. If a position is specified for one format keyword, the position must be used for all of them. If you group the format specification with double quotes, you need to quote the $ with a backslash:

set lang 2
format "%${lang}$s" one un uno
=> un

The position specifier is useful for picking a string from a set, such as this simple language-specific example. The message catalog facility described in Chapter 15 is a much more sophisticated way to solve this problem. The position is also useful if the same value is repeated in the formatted string.

The flags in a format are used to specify padding and justification. In the following examples, the # causes a leading 0x to be printed in the hexadecimal value. The zero in 08 causes the field to be padded with zeros. Table 4-5 summarizes the format flag characters.

format "%#x" 20
=> 0x14
format "%#08x" 10
=> 0x0000000a

After the flags you can specify a minimum field width value. The value is padded to this width with spaces, or with zeros if the 0 flag is used:

Table 4-5. Format flags

-

Left justify the field.

+

Always include a sign, either + or -.

space

Precede a number with a space, unless the number has a leading sign. Useful for packing numbers close together.

0

Pad with zeros.

#

Leading 0 for octal. Leading 0x for hex. Always include a decimal point in floating point. Do not remove trailing zeros (%g).

format "%-20s %3d" Label 2
=> Label               2

You can compute a field width and pass it to format as one of the arguments by using * as the field width specifier. In this case the next argument is used as the field width instead of the value, and the argument after that is the value that gets formatted.

set maxl 8
format "%-*s = %s" $maxl Key Value
=> Key     = Value

The precision comes next, and it is specified with a period and a number. For %f and %e it indicates how many digits come after the decimal point. For %g it indicates the total number of significant digits used. For %d and %x it indicates how many digits will be printed, padding with zeros if necessary.

format "%6.2f %6.2d" 1 1
=>   1.00     01

The storage length part comes last but it only became useful in Tcl 8.4 where wide integer support was added. Otherwise Tcl maintains all floating point values in double-precision, and all integers as long words. Wide integers are a minimum of 64-bits wide. By adding the l (long) word length specifier, we can see the difference between regular and wide integers.

format %u -1
=> 4294967295
format %lu -1
=> 18446744073709551615

The scan Command

The scan command parses a string according to a format specification and assigns values to variables. It returns the number of successful conversions it made, unless no capture variables are given, in which case it returns the scan matches in a list. The general form of the command is:

scan string format ?var? ?var? ?var? ...

The format for scan is nearly the same as in the format command. The %c scan format converts one character to its decimal value.

The scan format includes a set notation. Use square brackets to delimit a set of characters. The set matches one or more characters that are copied into the variable. A dash is used to specify a range. The following scans a field of all lowercase letters.

scan abcABC {%[a-z]} result
=> 1
set result
=> abc

If the first character in the set is a right square bracket, then it is considered part of the set. If the first character in the set is ^, then characters not in the set match. Again, put a right square bracket immediately after the ^ to include it in the set. Nothing special is required to include a left square bracket in the set. As in the previous example, you will want to protect the format with braces, or use backslashes, because square brackets are special to the Tcl parser.

The binary Command

Tcl 8.0 added support for binary strings. Previous versions of Tcl used null-terminated strings internally, which foils the manipulation of some types of data. Tcl now uses counted strings, so it can tolerate a null byte in a string value without truncating it.

This section describes the binary command that provides conversions between strings and packed binary data representations. The binary format command takes values and packs them according to a template. For example, this can be used to format a floating point vector in memory suitable for passing to Fortran. The resulting binary value is returned:

binary format template value ?value ...?

The binary scan command extracts values from a binary string according to a similar template. For example, this is useful for extracting data stored in binary data file. It assigns values to a set of Tcl variables:

binary scan value template variable ?variable ...?

Format Templates

The format template consists of type keys and counts. The count is interpreted differently depending on the type. For types like integer (i) and double (d), the count is a repetition count (e.g., i3 means three integers). For strings, the count is a length (e.g., a3 means a three-character string). If no count is specified, it defaults to 1. If count is *, then binary scan uses all the remaining bytes in the value.

Several type keys can be specified in a template. Each key-count combination moves an imaginary cursor through the binary data. There are special type keys to move the cursor. The x key generates null bytes in binary format, and it skips over bytes in binary scan. The @ key uses its count as an absolute byte offset to which to set the cursor. As a special case, @* skips to the end of the data. The X key backs up count bytes. The types are summarized in Table 4-6. In the table, count is the optional count following the type letter.

Table 4-6. Binary conversion types

a

A character string of length count. Padded with nulls in binary format.

A

A character string of length count. Padded with spaces in binary format. Trailing nulls and blanks are discarded in binary scan.

b

A binary string of length count. Low-to-high order.

B

A binary string of length count. High-to-low order.

h

A hexadecimal string of length count. Low-to-high order.

H

A hexadecimal string of length count. High-to-low order. (More commonly used than h.)

c

An 8-bit character code. The count is for repetition.

s

A 16-bit integer in little-endian byte order. The count is for repetition.

S

A 16-bit integer in big-endian byte order. The count is for repetition.

i

A 32-bit integer in little-endian byte order. The count is for repetition.

I

A 32-bit integer in big-endian byte order. The count is for repetition.

f

Single-precision floating point value in native format.The count is for repetition.

d

Double-precision floating point value in native format. The count is for repetition.

w

A 64-bit integer in little-endian byte order. The count is for repetition. (Tcl 8.4)

W

A 64-bit integer in big-endian byte order. The count is for repetition. (Tcl 8.4)

x

Pack count null bytes with binary format.

Skip count bytes with binary scan.

X

Backup count bytes.

@

Skip to absolute position specified by count. If count is *, skip to the end.

Numeric types have a particular byte order that determines how their value is laid out in memory. The type keys are lowercase for little-endian byte order (e.g., Intel) and uppercase for big-endian byte order (e.g., SPARC and Motorola). Different integer sizes are 16-bit (s or S), 32-bit (i or I), and, with Tcl 8.4 or greater, 64-bit (w or W). Note that the official byte order for data transmitted over a network is big-endian. Floating point values are always machine-specific, so it only makes sense to format and scan these values on the same machine.

There are three string types: character (a or A), binary (b or B), and hexadecimal (h or H). With these types the count is the length of the string. The a type pads its value to the specified length with null bytes in binary format and the A type pads its value with spaces. If the value is too long, it is truncated. In binary scan, the A type strips trailing blanks and nulls.

A binary string consists of zeros and ones. The b type specifies bits from low-to-high order, and the B type specifies bits from high-to-low order. A hexadecimal string specifies 4 bits (i.e., nybbles) with each character. The h type specifies nybbles from low-to-high order, and the H type specifies nybbles from high-to-low order. The B and H formats match the way you normally write out numbers.

Examples

When you experiment with binary format and binary scan, remember that Tcl treats things as strings by default. A "6", for example, is the character 6 with character code 54 or 0x36. The c type returns these character codes:

set input 6
binary scan $input "c" 6val
set 6val
=> 54

You can scan several character codes at a time:

binary scan abc "c3" list
=> 1
set list
=> 97 98 99

The previous example uses a single type key, so binary scan sets one corresponding Tcl variable. If you want each character code in a separate variable, use separate type keys:

binary scan abc "ccc" x y z
=> 3
set z
=> 99

Use the H format to get hexadecimal values:

binary scan 6 "H2" 6val
set 6val
=> 36

Use the a and A formats to extract fixed width fields. Here the * count is used to get all the rest of the string. Note that A trims trailing spaces:

binary scan "hello world " a3x2A* first second
puts ""$first" "$second""
=> "hel" " world"

Use the @ key to seek to a particular offset in a value. The following command gets the second double-precision number from a vector. Assume the vector is read from a binary data file:

binary scan $vector "@8d" double

With binary format, the a and A types create fixed width fields. A pads its field with spaces, if necessary. The value is truncated if the string is too long:

binary format "A9A3" hello world
=> hello    wor

An array of floating point values can be created with this command:

binary format "f*" 1.2 3.45 7.43 -45.67 1.03e4

Remember that floating point values are always in native format, so you have to read them on the same type of machine that they were created. With integer data you specify either big-endian or little-endian formats. The tcl_platform variable described on page 193 can tell you the byte order of the current platform.

Binary Data and File I/O

When working with binary data in files, you need to turn off the newline translations and character set encoding that Tcl performs automatically. These are described in more detail on pages 120 and 219. For example, if you are generating binary data, the following command puts your standard output in binary mode:

fconfigure stdout -translation binary -encoding binary
puts [binary format "B8" 11001010]

Related Chapters

  • To learn more about manipulating data in Tcl, read about lists in Chapter 5 and arrays in Chapter 8.

  • For more about pattern matching, read about regular expressions in Chapter 11.

  • For more about file I/O, see Chapter 9.

  • For information on Unicode and other Internationalization issues, see Chapter 15.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.130.201