In this chapter, I will describe the limits of patterns, including what to do when you hit them. I will also cover the darker side of range patterns—matching anything but certain characters. In the second half of the chapter, I will go into more detail on pattern actions including flow control. Finally, I will cover some miscellaneous pattern matching issues that do not fit anywhere else.
As I said in Chapter 5 (p. 113), pattern pieces match as many characters as possible. This makes it a little tricky to match a single line, single word, or single anything. For example, the regular expression ".*
" matches a single line, but it also matches two lines because two lines end with a "
“. Similarly, it matches three lines, four lines, and so on. If you want to read lines one at a time from another program, then you cannot use this kind of pattern. The solution is to use the "^
“.
In Chapter 3 (p. 73), I showed that the "^
" matches the beginning of the input buffer. When ^
is the first character of a regular-expression range, it means match anything but the given characters. For example, the regular expression [^ab]
matches any character except a
or b
. The pattern [^a-zA-Z]
matches any character but a letter.[26]
A range can be used to build larger patterns. The pattern "[^ ]*
" matches the longest string not including a blank. For example, if the input buffer contained "For example, if the input buffer contained
“, the following expect
command could be called repeatedly to match each word in the input.
expect -re "([^ ]*) "
The range matches each word and the result is stored in $expect_out(1,string)
. The space at the end of the word is matched explicitly. Without the explicit space, the input buffer is left beginning with a space (”cow jumped ...
“) and subsequent matches return the null string before the first space.
Remember that the length of the match is important, but only after the starting position is taken into account. Patterns match the longest string at the first possible position in the input. In this example, 0 characters at column 0 are successfully matched even though the pattern can also match 3 characters at column 1. Because column 0 is before column 1, the earlier match is used.
There is no explicit means to match later matches than earlier ones, but often it is possible to simply pick a more descriptive pattern. In this example, the space can be skipped over. Alternatively, the *
can be replaced by a +
to force the pattern to be at least one letter. This effectively skips over the space between the words without the need to explicitly match it.
expect -re "[^ ]+"
Now the word is stored in "expect_out(0,string)
“. Because the pattern does not match whitespace, there is no need to select pieces of it, and the parentheses are no longer needed, simplifying the pattern further.
Here is the opening dialogue greeting from Uunet’s SMTP server. SMTP is the mail protocol used by most Internet computers. The server is normally controlled by a mail program to transfer mail from one host to another, but you can telnet
to it directly and type commands interactively. The telnet
program adds the first three lines, and Uunet sends back the line that begins "220
“:
% telnet relay1.uu.net smtp
Trying 192.48.96.5 ...
Connected to relay1.uu.net.
Escape character is `^]'.
220 relay1.UU.NET Sendmail 5.61/UUNET-internet-primary ready at Mon, 22 Feb 93
23:13:56 -0500
In the last line (which wraps over two physical lines on the page), the remote hostname appears immediately after the "220
“. In order to match and extract the hostname, use the following command:
expect -re " 220 ([^ ]+) "
There are several subtle things about this command. First of all, the SMTP protocol dictates that responses are terminated by
and that the initial response to the connection begins with the string 220
followed by the host or domain identification. Thus, you are guaranteed to see the string "220
“.
Unfortunately, the telnet
program prints out the IP address of the remote host name in its "Trying ...
" message. Since it is quite possible for part of the IP address to actually be "220
“, the pattern starts with
to match the end of the previous line, effectively forcing the 220
to be the first thing on its own line. A space is skipped and then the familiar "[^ ]+
" pattern matches the hostname.
Unlike the previous example, yet another space follows the "[^ ]+
" pattern. Since the pattern explicitly forces the hostname to be non-null, why is space needed at the end of the name? As I described in Chapter 4 (p. 89), network or other delays might crop up at any time. For example, if the greeting line had only partially printed by the time the pattern matching had begun, the input buffer might contain just "220 rela
“. Without the explicit space after the hostname, the pattern would match "rela
“. With the explicit space, the pattern will match the full "relay1.UU.NET
“.
Matching the hostname from the SMTP dialogue is not an artificial example. This technique can be used to convert IP addresses to hostnames when the IP-to-hostname directory entries do not exist, a common fault in many domains. In practice, the likelihood of a host running an SMTP server is much higher than the likelihood that its domain name server is correctly configured with complete reverse mappings. The gethostbyname
example script that comes with the Expect distribution resorts to this and a number of other techniques to convert host addresses to names.
The ability to automate telnet
opens up worlds of possibilities. All sorts of useful data can be collected and manipulated through interfaces that were originally designed only for humans.
Much of this information is rather open-ended. There may be no standards describing it other than a particular implementation. However, by studying sufficient output, you can usually come up with Expect scripts to read it back in. And if you cannot write an Expect script to understand a program’s output, chances are that humans cannot understand the output to begin with.
Writing scripts to understand natural language is not particularly difficult, but Expect does not give any particular assistance for the task. Regular expressions by themselves are certainly not sufficient to describe arbitrarily complex patterns. In some situations, it is even reasonable to avoid using complex patterns and instead match input algorithmically, using Tcl commands.
Take the case of automating ftp
. In Chapter 3 (p. 83), I showed that it was very easy to retrieve a file if the name was known in advance—either by the script or the user. If the name is not known, it is harder. For example, ftp
does not support directory retrieval. This can be simulated by retrieving every file in the directory individually. (You can automate this to some degree using ftp
’s built-in wildcards, but that does not handle subdirectories so it is not a complete solution and I will ignore it for now.)
Further, imagine that you want to only retrieve files created after a certain date. This requires looking at a “long” directory listing. As an example, here is a listing of the directory /published/usenix
on ftp.uu.net
.
ftp>cd published/usenix
250 CWD command successful. ftp>ls -lt
200 PORT command successful. 150 Opening ASCII mode data connection for /bin/ls. total 41 drwxrwsr-x 3 3 2 512 Sep 26 14:58 conference drwxr-sr-x 1695 3 21 39936 Jul 31 1992 faces lrwxrwxrwx 1 3 21 32 Jul 31 1992 bibliography -> /archive/doc/literary/obi/USENIX 226 Transfer complete. remote: -lt 245 bytes received in 0.065 seconds (3.7 Kbytes/s) ftp>
It is easy to pick out the directory listing from this output. As before, you can see the protocol responses that each start with a three-digit number. These can be matched directly, but there is no way of separately matching all of the bits and pieces of information in the directory listing in a single pattern. There is just too much of it. And this is a short directory. Directories can contain arbitrarily many files.
Upon close examination, you can see that the directory lines use different formats. For example, the third file is a symbolic link and shows the link target. The second and third files show modification dates with the year while the first file shows the date with the time. And for a dash of confusion, the formatting is inconsistent—the columns do not line up in the same place from one entry to the next.
One way to deal with all of this is to match the fields in each line, one line at a time in a loop. The command to match a single line might start out like this:
expect -re "d([^ ]*) +([^ ]*) +([^ ]*) +([^ ]*) +( ...
The command is incomplete—the pattern does not even fit on the page, and it only describes directories (notice the "d
" in the front). You would need similar patterns to match other file types. This complexity might suggest that this is the wrong approach.
An alternative is to use patterns only at a very superficial level. You can match the individual lines initially and then later break up the lines themselves. At this point, matching a single line should be no surprise. It is just a variation on what I have already shown in many different forms:
expect -re "([^ ]*) "
To get the individual pieces out, you can now use any of the plain old Tcl commands. You can treat $expect_out(1,string)
as a simple list and index it. For example, to obtain the file name:
lindex $expect_out(1,string) 8
lrange $expect_out(1,string) 5 6
Using the following commands, you can get the file’s type field (the first character on each line from "ls -l
“) and then process the files, directories, and symbolic links differently:
set type [string index $expect_out(1,string) 0] switch—$type "-" { # file } "d" { # directory } "l" { # symbolic link } default { # unknown }
With no flags other than "—
“, the patterns in the switch
command are a subset of the glob patterns (everything but ^
and $
). This fragment of code actually comes out of the rftp
(for “recursive ftp
“) script that comes with Expect as an example.
With actions, the whole command to switch on the file type is not much more complicated. There are two of them—one to “get” files and one to “put” files. Below is a procedure to put files. The procedure is called for each file in the directory listing. The first argument is the name of the file, and the second argument is the first character of the type
field.
proc putentry {name type} { switch—$type "d" { # directory if {$name=="." || $name==".."} return putdirectory $name } "-" { # file putfile $name } "l" { # symlink, could be either file or directory # first assume it's a directory if [putdirectory $name] return putfile $name } default { puts "can't figure out what $name is, skipping " } }
For each directory encountered, putdirectory
is called, which changes directories both remotely and locally and then recursively lists the new directory, calling putentry
again for each line in the list. The files ".
" (current directory) and "..
" (parent directory) are skipped.
Regular files are transferred directly by sending a put
command inside putfile
. Symbolic links are trickier since they can point either to directories or plain files. There is no direct way to ask, so the script instead finds out by blindly attempting to transfer the link as a directory. Since the attempt starts by sending a "cd
" command, the putdirectory
procedure fails if the link is not a directory. Upon failure, the script then goes on to transfer it as a plain file. Upon success, the procedure returns.
Occasionally it is useful to prevent the pattern matcher from performing any special interpretation of characters. This can be done using the -ex
flag, which causes exact matching. For example, the following command matches only an asterisk.
expect -ex "*"
When using -ex
, patterns are always unanchored. The ^
and $
match themselves literally even if they appear as the first or last characters in a pattern. So the following command matches the sequence of characters "^
“, "*
“, "
“, and "$
“. The usual Tcl interpretations apply. Hence, the
is still interpreted as a single character.
expect -ex "^* $" ;# matches ^ * $
expect -ex "\n"
Tcl interprets the \n
as
and the exact matching occurs with no further backslash processing. This statement matches a backslash followed by an n
.
The results of exact matches are written to expect_out(buffer)
and expect_out(0,string)
as usual, although expect_out(0,string)
is necessarily set to the original pattern.
Using -ex
may seem like a way to simplify many patterns, but it is really only useful in special circumstances. Most patterns either require wildcards or anchors. And strings such as "foo
" are so simple that they mean the same thing when specified via -gl
in the first place. However, -ex
is useful when patterns are computer-generated (or user-supplied). For example, suppose you are creating Expect patterns dynamically from a program that is printing out SQL lines such as:
select * from tbl.col where col like 'name?'
To protect this from misinterpretation by -gl
or -re
, you would have to analyze the string and figure out where to insert backslashes. Instead, it is much simpler to pass the whole thing as a pattern using -ex
. The following fragment reads a pattern from a file or process and then waits for it from the spawned process.
set pat [gets $patfile] expect -ex $pat
If you are hand-entering SQL commands in your Expect scripts, then you have to go a step further and protect the commands from being interpreted by Tcl. You can use braces to do this. Here is an example, combined with the -ex
flag.
expect -ex {select from * tbl.col where col like 'name?'}
I show this only to discourage you again from using braces around patterns. While it works in this example, it is not necessary since you can prevent the substitutions at the time you handcode it by adding backslashes appropriately. Chances are that you will want to make variable substitutions in these or else they would have been stored in a file anyway. And if you are using more than a few patterns like these, you probably will not have them embedded in your scripts, so you do not need to worry about the Tcl substitutions in the first place.
Matching a single line is such a common task that it is worth getting very familiar with it. The one-line script on page 131 matches a single line and this same technique will show up in many more scripts so it is worth examining closely here.
Suppose you want to search for a file in the file system with the string "frob
" at the beginning of the name. There may be many files named "frob
" (well, maybe not). You are just interested to know if there is at least one. The obvious tool to use is find
. Unfortunately, find
provides no control over the number of files it finds. You cannot tell it to quit after one. Here is an Expect script to do just that:
spawn find . -name "frob*" -print set timeout −1 expect -re "[^ ]* "
The script starts by spawning the find
command. The timeout is disabled since this could be a very long running command. The expect
pattern waits for one complete line to appear, and then the script exits. This works because the range waits for any character that is not a
and the *
waits for any number of them—that is, any number of characters that are not
’s. The second
both allows and forces a single
. Finally the
matches the linefeed in the carriage-return linefeed sequence. The only thing that can be matched is a single line.
Without Expect, it is possible to get find
to kill itself by saving its process id in a file and then forking the kill
command from an -exec
clause in the find
. However, doing this is fairly painful. And find
is just a special case. Many other commands do not have the power of find
yet share the same problem of lacking any sophisticated control. For example, grep
does not have any way to execute arbitrary commands when it matches. There is no way to tell grep
to print only the first match.
For this and other commands, here is an Expect script which I call firstline
:
#!/usr/local/bin/expect — eval spawn $argv set timeout −1 expect -re "[^ ]* "
Immediately after matching the first line of output, firstline
exits. Note that if the underlying process produces output quickly enough, the script may actually print several lines of output. That does not mean the pattern is matching multiple lines. It is still only matching one. However, by default Expect prints out everything it sees whether or not it matches.
In Chapter 7 (p. 171), I will describe how to change this default so that you can write scripts that only print out what they match.
Having matched a single line, it is no longer possible to automatically break it up into pieces stored in the array expect_out
. Tcl does, however, offer a standalone version of both the regular expression and glob pattern matchers.
Glob pattern matching is explicitly done using the string match
command. The command follows the format:
string match pattern string
The string replaces the implicit reference to the input buffer in an expect
command. The command returns 1 if there is a match or 0 if there is no match. For example:
if [string match "f*b*" "foobar"] { puts "match" } else { puts "no match" }
The switch
command (demonstrated on page 131) is a little more like the expect
command. It supports multiple patterns and actions, but like "string match
“, switch
uses an explicit string. Neither switch
nor "string match
" support the ^
and $
anchors.
Tcl’s regexp
command matches strings using regular expressions. The regexp
command has the same internal pattern matcher that expect
uses but the interface is different. The expect
command provides the string implicitly while regexp
requires that the string be an explicit argument.
The calling syntax is as follows:
regexp pattern string var0 var1 var2 var3
. . .
The first argument is a pattern. The second argument is the string from which to match. The remaining arguments are variables, set to pieces of the string that match the pattern. The variable var0
is set to the substring that was matched by the whole pattern (analogous to expect_out(0,string)
). The remaining variables are set to the substrings that matched the parenthesized parts of the pattern (analogous to expect_out(1,string)
through expect_out(9,string)
).
expect1.1> set addr "[email protected]"
For example, the following command separates the Internet email address (above) into a user and host:
expect1.2>regexp (.*)@(.*) $addr ignore user host
1 expect1.3>set user
usenet expect1.4>set host
uunet.uu.net
The first parenthesized pattern matches the user and is stored in the variable user
. The @
matches the literal @
in the address, and the remaining parenthesized pattern matches the host. Whatever is matched by the entire pattern is stored in ignore
, called this because it is not of interest here. This is analogous to the expect
command where expect_out(0,string)
is often ignored. The command returns 1
if the pattern matches or 0
if it does not.
The regexp
command accepts the optional flag "-indices
“. When used, regexp
stores a list of the starting and ending character positions in each output variable rather than the strings themselves. Here is the previous command with the -indices
flag:
expect1.5>regexp -indices (.*)@(.*) $addr ignore user host
1 expect1.6>set user
0 5 expect1.7>set host
7 18
The expect
command also supports an "-indices
" flag (shown in Chapter 5 (p. 123)) but there are differences between the way expect
and regexp
support it. The expect
command writes the indices into the expect_out
array alongside the strings themselves so you do not have to repeat the expect
command to get both strings and indices. Also, the elements are written separately so that it is possible to extract the start or ending index without having to break them apart.
The regsub
command makes substitutions in a string that matches a regular expression. For example, the following command substitutes like
with love
in the value of olddiet
. The result in stored in the variable newdiet
.
expect1.1>set olddiet "I like cheesecake!"
I like cheesecake! expect1.2>regsub "like" $olddiet "love" newdiet
1 expect1.3>set newdiet
I love cheesecake!
If the expression does not match, no substitution is made and regsub
returns 0
instead of 1
. However, the string is still copied to the variable named by the last parameter.
Strings that match parenthesized expressions can be referred to inside the substituted string (the third parameter, love
in this example). The string that matched the first parenthesized expression is referred to as "1
“, the second as "2
“, and so on up to "9
“. The entire string that matched is referred to as "