Chapter 5. Regular Expressions

The previous chapter described glob patterns. You were probably familiar with them from the shell. Glob patterns are very simple and are sufficient for many purposes. Hence they are the default style of pattern matching in expect commands. However, their simplicity brings with it limitations.

For example, glob patterns cannot match any character not in a list of characters, nor can glob patterns match a choice of several different strings. Both of these turn out to be fairly common tasks. And while both can be simulated with a sequence of several other commands, Expect provides a much more powerful and concise mechanism: regular expressions.

Regular Expressions—A Quick Start

In order to jumpstart your knowledge of regular expressions (regexp for short), I will start out by noting the similarities. As the following table of examples shows, every glob pattern is representable by a regular expression. In addition, some regular expressions cannot be represented as glob patterns.

Table 5-1. Comparison of glob patterns and regular expressions

glob

regexp

English

s

s

literal s

*

*

literal *

^

^

beginning of string

$

$

end of string

[a-z]

[a-z]

any character in the range a to z

 

[^a-z]

any character not in the range a to z

?

.

any single character

*

.*

any number of characters

For example, both the glob pattern foo and the regular expression foo match the literal string "foo“. Backslash works in the usual way, turning the following character into its literal equivalent. "^" and "$" also work the same way as before. Regular expression ranges work as before, plus they can also be used to match any character not in the range by placing a "^" immediately after the left bracket. (I will show more detail on this later.) Besides this, the only significant differences in the table are the last two lines which describe how to match any single character and any number of any characters.

Except for ".*“, each of the patterns in the table is called an atom. A * appended to an atom creates a pattern that matches any number (including zero) of the particular atom.

For example, the regular expression "a*" matches any string of a’s, such as "a“, "aa“, "aaaaaaaaa" and “”. That last string has no a’s in it at all. This is considered a match of zero a’s.

The pattern [0-9]* matches strings made up of integers such as "012" and "888“. Notice that the atom does not have to match the same literal value each time. When matching "012“, the range "[0-9]" first matches "0“, then it matches "1“, and finally matches "2“.

You can uses ranges to construct more useful patterns. For example [1-9][0-9]* matches any positive integer. The first atom matches the first digit of the number, while the remaining digits are matched by the "[0-9]*“.

C language identifiers can be matched with the pattern "[a-zA-Z_][a-zA-Z0-9_]*“. This is similar to the previous pattern. In both cases the first character is restricted to a subset of the characters that can be used in the remaining part of the string.

In both cases the * only applies to the immediately preceding range. That is because the * only applies to the immediately preceding atom. One range is an atom; two ranges are not an atom.

Atoms by themselves and atoms with a * appended to them are called pieces. Pieces can also consist of atoms with a + or ? appended. An atom followed by + matches a sequence of one or more matches of the atom. An atom followed by ? matches the atom or the empty string. For example, "a+" matches "a" and "aa" but not “”, while "a?" matches "a" and “” but not "aa“. The pattern "0x[0-9a-f]+" matches a hexadecimal number in the C language such as "0x0b2e" or "0xffff“. The pattern "-?[1-9][0-9]*" matches positive or negative integers such as 1, 10, 1000, -1, and -1000. Notice how the [1-9] range prevents a zero from being the first digit, preventing strings like -05 and 007.

"-?[1-9][0-9]*" is a sequence of three pieces. Any sequence of pieces is called a branch. Branches separated by a | match any of the branches. For example, you could extend the previous pattern to match any integer with the pattern "-?[1-9][0-9]*|0“. The first branch matches any nonzero integer while the second branch matches zero itself.

Tcl integers can be written in decimal, hex, or octal. The following pattern uses three patterns to match such integers: "-?[1-9][0-9]*|0x[0-9a-f]+|0[0-7]*“. The first branch (”-?[1-9][0-9]*“) matches any positive or negative decimal constant. The second branch matches any hex constant. The third branch matches any octal constant. A separate branch for zero is not needed, since it is matched by the octal branch already. Fortunately, zero in octal is equal to zero in decimal, so there is no problem interpreting it in a different way!

Identifying Regular Expressions And Glob Patterns

In order to actually use a regular expression, you must do two things. First, you must backslash any characters that are special to both Tcl and regular expressions. For example, the regular expression to match a single digit is "[0-9]“. To prevent Tcl from trying to evaluate 0-9 as a command, the leading bracket must be prefixed with a backslash so it looks like this:

[0-9]

See Chapter 4 (p. 91) for more information on using backslashes.

The second thing you must do is to tell expect that a pattern is a regular expression. By default, expect assumes patterns are glob patterns. Another line can be added to Table 5-1.

glob

regexp

English

-gl

-re

pattern type prefix

Patterns prefixed with -re are regular expressions. For example, the following command matches "a“, "aa“, and "aaaaa“. It does not match "ab“.

expect -re "a*"                    ;# regexp pattern

Without the -re, the command matches "aa“, "ab“, and "ac" (among other things).

expect "a*"                    ;# glob pattern

It is possible to have a mixture of glob patterns and regular expressions. In the following example, "a*" is a regular expression but "b*" is a glob pattern.

expect {
    -re "a*"    {action1}
    "b*" {action2}
}

The expect command also accepts the -gl flag. The -gl flag tells expect that the pattern is a glob pattern. This is useful if the pattern looks like one of the keywords such as timeout or a flag such as -re.[21]

expect {
    eof {found_real_eof}
    -gl "timeout" {found_literal_timeout}
    -gl "-re" {found_real_dash_r_e}
}

You might also want to pass the pattern as a variable. In this case, the -gl flag also protects the pattern from matching a keyword or flag.

expect -gl $pattern

If you completely declare your pattern types, you can embed them inside of subroutines and pass patterns as arguments without worrying about them being misinterpreted. This is especially useful if you might reuse the subroutines in the future or allow users to pass arbitrary patterns in to a script. Users of your scripts should not have to care about the keywords inside of Expect.

Using Parentheses To Override Precedence

Once you understand how to build regular expressions, you need not worry about remembering the terms “atom”, “piece”, and “branch”. The terms exist only to help you learn the precedence of the regular-expression operators. To avoid confusion, from now on I will generically refer to any subpattern of a complete pattern when it is unimportant whether it is an atom, piece, or branch.

Because operators such as * and + act only on atoms, they cannot be applied directly to pieces and branches. For example, the pattern ab* matches an a followed by any number of b’s. In order to treat any subpattern—atom, piece, or branch—as an atom, enclose it in parentheses. Thus, in order to match any number of ab’s, use the pattern "(ab)*“.

Matching real numbers is a good exercise. Real numbers have a whole portion to the left of the decimal point and a fractional portion to the right. A direct rendering of this concept is "-?[0-9]*.?[0-9]*“. Notice the period is escaped by placing a backslash in front of it. This forces it to match a literal period rather than any character. The entire pattern matches things like "17.78“, "-8“, and "0.21“. Unfortunately, it also accepts 0000.5, which does not seem quite right. You can reject leading zeros while still accepting a single zero the same way I did earlier—with a branch: "-?(0|[1-9][0-9]*)?.?[0-9]*“. This pattern accepts the earlier numbers but it rejects "0000.5“. Unfortunately, it still matches "-0“. You can fix this as an exercise but it is not worth worrying about that much. In Chapter 6 (p. 138), I will demonstrate how to handle this problem much more easily.

To use this regular expression in a command, any characters special to Tcl must also be escaped as I described in the previous section. Here is what the complete command might look like:

expect -re "-?(0|[1-9][0-9]*)?\.?[0-9]*"

In practice, most patterns do not get very complex. It is almost always possible to get by using simple patterns. For example, you could use patterns that accept bad data (such as malformed numbers) if you know that the program never generates them anyway. Deciding how much effort to invest in writing patterns takes a little experience. But it will come with time.

Using Parentheses For Feedback

In the previous section, parentheses were used to group subpatterns together. Parentheses also play another role—a role that leads to much shorter scripts than would otherwise be possible. When a regular expression successfully matches a string in the input buffer, each part of the string that matches a parenthesized subpattern is saved in the array expect_out. The string that matches the first parenthesized subpattern is stored in "expect_out(1,string)“. The string that matches the second is stored in "expect_out(2,string)“. And so on, up to "expect_out(9,string)“.

For example, suppose you want to know the characters that occur between two other characters. If the input buffer contains "junk abcbcd" and I use the pattern "a(.*)c“, this matches the input buffer and so expect makes the following assignment:

set expect_out(1,string) "bcb"

expect also stores the string that matches the entire pattern in "expect(0,string)“:

set expect_out(0,string) "abcbc"

Finally, expect stores the whole string that matches the entire pattern as well as everything that came before it in expect_out(buffer):

set expect_out(buffer) "junk abcbc"

The "d" in the input buffer was never matched, so it remains there.

The last two assignments (to expect_out(buffer) and expect_out(0,string)) occur with glob patterns as well.

The values of expect_out are never deleted but can be overwritten by new expect commands. Assuming that the input buffer holds "junk abcbcd“, the following sequence of commands both match:

expect -re "a(.*)c"
expect -re "d"

Earlier I showed the assignments that occur during the first command. When the second command executes, two other assignments take place:

set expect_out(buffer) "d"
set expect_out(0,string) "d"

expect_out(1,string) remains equal to "bcb" from the earlier expect command.

More On The timed–read Script

In Chapter 3 (p. 77), I defined an Expect script called timed-read. Called from the shell with the maximum number of seconds to wait, the script waits for a string to be entered, unless the given time is exceeded. Here it is again:

set timeout $argv
expect "
" {
    send [string trimright "$expect_out(buffer)" "
"]
}

In that earlier example I used "string trimright" to trim off the newline. It is possible for the expect command to do this in the pattern matching process, but not with glob patterns. I will use a regular expression to rewrite the script above without using the string command.

# timed read using a regular expression
set timeout $argv
expect -re "(.*)
" {
    send $expect_out(1,string)
}

In the expect command, the -re flag declares the following pattern to be a regular expression instead of a glob pattern. As before, the pattern is quoted with double quotes. The pattern itself is:

(.*)

This matches any string of characters followed by a " “. All the characters except for the get stored in expect_out(1,string). Since the pattern is successfully matched, the action is executed. The action is to send all the characters in $expect_out(1,string) to the standard output. Of course, these characters are exactly what was just typed without the terminating " “.

Pattern Matching Strategy

At the beginning of the previous chapter was an example where the string "philosophic " was matched against the pattern "*hi*“. This pattern can be rewritten as a regular expression:

expect -re ".*hi.*"

Adding parentheses around various pieces of the pattern makes it possible to see exactly how the string is matched. Here is a snapshot of Expect running interactively. First I entered the expect command to wait for a string. Then I entered philosophic and pressed return. Finally I printed the first four elements of expect_out surrounded by angle brackets.

expect1.1> expect -re "(.*)(hi)(.*)"philosophic
expect1.2> for {set i 0} {$i<4} {incr i} {
+> send "<$expect_out($i,string)>
"}
<philosophic
>
<philosop>
<hi>
<c
>
expect1.3>

You can see that the entire string matched was "philosophic “. The first .* matched philosop while the second .* matched "c “. "hi“, of course, matched "hi" but notice that it was the second "hi" in the string, not the first "hi“. This is similar to the way the analogous glob pattern worked in the previous chapter, but I will go over it again in the context of regular expressions.

Each piece of the pattern attempts to match as many characters as possible. Pattern pieces are matched before remaining subpatterns to the right in the pattern. So if there is any question about which pattern piece matches a string, it is resolved in such a way that the leftmost piece matches the longest run of characters. For example, the pattern .* always matches its input. By comparison, so does ".*.*“. In that case, the first .* matches the entire input while the second .* matches the empty string.

When there are branches, they are tried from left to right. The first branch used is that which allows the entire pattern to match. For example, if the input buffer contains "ac" and the pattern is "a(b|c|d)“, the second branch is used to match the "c“.

When branching, the goal is not to match as many characters as possible. Rather, the goal is simply to match at all. Consider the behavior of the pattern a*(b*|(ab)*) on the input "aabab“. The a* matches the first two a’s and the b* branch matches the first "b“. Since only one branch is necessary for the entire pattern to match, the entire pattern is considered to have matched successfully. Ignoring the first branch for the moment, it is possible to match every character in the input by only matching one a with the pattern "a*" and then using the second branch to match "abab“. If this is the behavior you want, you will have to rewrite the pattern. One possibility is just to reverse the branches. Another is the pattern "a*(b|ab)*“.

Branches are analogous to writing the patterns entirely separately. For example, the pattern a(b*)|b could be written in either of two ways:

expect -re "a(b*)|b" action

or
expect {
    -re "a(b*)" action
    -re "b"     action
}

In either case, the effective behavior is the same. When multiple patterns are listed, they behave exactly like a branched pattern. You may choose to combine them if the action is identical, but it is not necessary. And it may be more readable not to combine them. In this example, the b at the end of the branch is hard to see, while the second version makes it very clear.[22]

Nested Parentheses

The pattern a*((ab)*|b*) has nested parentheses. How are the results stored in the array expect_out? They are determined by counting left parentheses: the subexpression which starts with the Nth left parenthesis corresponds with expect_out(N,...). For example, the string that matches ((ab)*|b*) is stored in expect_out(1,string), and the string that matches (ab) is stored in expect_out(2,string). Of course, the only string that could possibly match "ab" is "ab“. If you want to record the string of ab’s that corresponds to (ab)* then you have to put another pair of parentheses around that, as in "((ab)*)“.

Strings that match parenthesized subpatterns are stored in the expect_out array

Figure 5-1. Strings that match parenthesized subpatterns are stored in the expect_out array

The string matched by the whole pattern is stored in expect_out(0,string). This makes sense if you imagine the whole pattern is wrapped in another set of parentheses. And you can also imagine the whole pattern prefaced by a .* pattern[23] and wrapped in yet another pair of parentheses to determine the value of expect_out(buffer).

The original pattern is shaded. Text that matches the imaginary pattern (.*(...)) is stored in expect(0,string) and expect_out(buffer).

Figure 5-2. The original pattern is shaded. Text that matches the imaginary pattern (.*(...)) is stored in expect(0,string) and expect_out(buffer).

Always Count Parentheses Even Inside Of Alternatives

To decide which element of expect_out to use, count the number of prior parenthesized expressions. The simplest way to do this is just to count the number of left parentheses. This works even if the parentheses occur in an alternative.

Consider the pattern "(a)|(b)“. If the string is "a“, the variable expect_out(1,string) will be set to "a“. If the string is "b“, expect_out(2,string) will be set to "b“. In this way, it is possible that expect_out(2,string) can be defined but not expect_out(1,string).

This behavior may seem to be a disadvantage—the limit of nine parentheses can be used up even when appearing in non-matching alternatives.[24] But the advantage is that you can know ahead of time where in expect_out matching strings will be without worrying about whether other alternatives matched or not. For instance, the pattern a*((ab)*|b*)(c*) is similar to the pattern from the previous example but with (c*) appended. The variables expect_out(1,string) and expect_out(2,string) are set as before. The string matching (c*) is stored in expect_out(3,string) whether or not (ab) is matched.

In cases like this one, it may not be immediately evident whether an element of expect_out was written by the current expect since the elements can retain values from prior expect commands. If you need to be certain, set the element to an empty string (or some otherwise appropriate value) before the expect command.

set expect_out(1,string) "unassigned"
set expect_out(2,string) "unassigned"
expect -re "a*((ab)*|b*)(c*)"
if [string compare "unassigned" $expect_out(1,string)] {
    # expect_out(1,string) has been assigned, so use it
}

Example—The Return Value From A Remote Shell

While I have shown some fairly complex patterns, real patterns are usually pretty simple. In fact, sometimes other issues can make the patterns seem like the easy part of writing scripts.

The rsh program executes a command on a remote host. For example, the following command executes the command "quack twice" on host duck.

% rsh duck quack twice

While quack is a mythical program, you can imagine it shares a trait common to many programs. Namely, quack reports its status by returning an exit value. If quack works correctly, it exits with the value 0 (which by convention means success). Otherwise it exits with the value 1 (failure). From the C-shell, the status of the last command is stored in the variable status. I can demonstrate a successful interactive invocation by interacting directly with the C-shell.

% quack twice
% echo $status
0

Unfortunately, if quack is executed via rsh, the same echo command will not provide the exit status of quack. In fact, rsh does not provide any way of returning the status of a command. Checking the value of status after running rsh tells you only whether rsh itself ran successfully. The status of rsh is not really that useful. It reports problems such as "unknown host" if you give it a bogus host. But if rsh locates the host and executes the command, that is considered a success and 0 is returned no matter what happens inside the command. In fact, the rsh is considered a success even if the command is not found on the remote host!

% rsh duck date
Wed Feb 17 21:04:17 EST 1993
% echo $status
0
% rsh duck daet
daet: Command not found.
% echo $status
0
% rsh duuck date
duuck: unknown host
% echo $status
1

There is no easy way to fix rsh without rewriting it. However, it is possible to write an Expect script that uses rlogin to do the job. Fortunately, rsh aims to provide an environment as close as possible to rlogin, so the hard part is done already. All that is left is to extract the right status. Here is an Expect script that does it—executing the command remotely and returning the remote status locally.

#!/usr/local/bin/expect --
eval spawn rlogin [lindex $argv 0]
expect "% "
send "[lrange $argv 1 end]
"
expect "% "
send "echo $status
"
expect -re "
(.*)
"
exit $expect_out(1,string)

The second line spawns an rlogin process to the remote host. Next, the script waits for a prompt from the remote host. For simplicity here, the prompt is assumed to end with "%“. The lrange extracts the commands and any arguments from the original invocation of the script. A return character is appended, and the command is sent to the remote host. After reading another prompt, the status is read by having the shell echo it.

Notice that in the second send command, the $ is preceded by a backslash. This prevents Tcl from doing variable substitution on the variable status locally. The string "$status" has to be sent literally. (When using the Bourne shell, the Expect script would have to ask for $? instead of $status.)

The regular expression " (.*) " is used to pick out the status. To see why this particular pattern is used, it helps to consider what the process sends back after the script has sent the string "echo $status “. The following figure shows the whole dialogue.

Dialogue from an automated rsh script

Figure 5-3. Dialogue from an automated rsh script

Upon logging in, the first thing to appear is the message of the day followed by the prompt (”%“). Then the command "quack twice" is sent with a return appended. This is echoed by the remote shell, and you can see that the return is echoed as a sequence. Some time passes as the command runs. If it produces output, it would appear at this point. Finally another prompt appears after which the script requests that the status be echoed. The request itself is echoed followed by a " “, the status itself, and yet another " “. Another prompt appears, but the script has the information it wants in expect_out(1,string) already. So the script immediately terminates with the appropriate exit value.

It still may not be clear why the shell responds the way it does in this interaction. By default, a shell echoes everything that is sent to it. This is perfectly normal and is exactly what happens when you type directly to a shell. The shell arranges things so that when you press the x key, the letter x is echoed so that you can see it. Some translations also take place. For instance, when you press return, a return-linefeed sequence ( ) is echoed. Sure enough, that is just what can be seen from the remote shell. And it explains why the result of echo is seemingly surrounded by these characters. The first pair is the end of the echoed command, while the second is formatting from the echo command itself. (As I mentioned earlier, the echo command actually writes a newline ( ) and the terminal driver converts this to a return-linefeed sequence ( ).)

A few minor modifications can help the script. First, it is unlikely that you want any of the expect commands to time out, so the timeout should be disabled by saying "set timeout −1“. Second, it is not necessary to send the two commands separately. Last, it is not necessary to explicitly write both characters in the return-newline sequence. You just need the characters that directly surround the status.

The rewritten script is:

#!/usr/local/bin/expect --
eval spawn rlogin [lindex $argv 0]
set timeout −1
expect -re "(%|#|\$) "
send "[lrange $argv 1 end];echo $status
"
expect -re "
(.*)
"
exit $expect_out(1,string)

Although it is a good start, even this new script is not yet a complete replacement for rsh. This script only works for remote commands that produce no output because the final expect command matches anything after the first line of output. Try fixing the script so that it works for commands that produce any amount of output.

Matching Customized Prompts

This new script (above) manages to find yet another excuse for a regular expression. The string it is capable of finding is either "%" or "$" or "#“. This is a pretty good stab at recognizing common prompts since traditionally most C-shell users end prompts with "%“, most Bourne shell users end prompts with "$“, and root prompts in any shell typically end with "#“. The $ has to be quoted because it is special to the pattern matcher. (More on this on page 123.)

Of course, users are free to customize prompts further, and some do, wildly so. There is no way to pick a pattern for the prompt that suffices for everyone. If you are just writing a script for yourself, you can always find a pattern that will match your prompt. But if you want a script that works for any user, this can be a challenge. Users are free to change their prompt at any time anyway and you cannot predict the future.

One reasonable solution is to have each user define a pattern that matches their own prompt. A good way to do this is to have users store their patterns in the environment variable EXPECT_PROMPT. As a reminder, this variable should be set immediately after the customization of the real prompt in their .cshrc or wherever they do set it. By keeping them together, when the user changes their prompt, they will naturally think to change their prompt pattern at the same time.

Here is Tcl code to read the environment variable EXPECT_PROMPT. The variable is retrieved from env, a predefined variable that holds all environment values. If the variable is not set, a default prompt pattern is used so that there is a good chance of still having a script function correctly.

if [info exists env(EXPECT_PROMPT)] {
    set prompt $env(EXPECT_PROMPT)
} else {
    set prompt "(%|#|\$) "          ;# default prompt
}

Once prompt is set, you can use it as follows:

expect -re $prompt

An extract of a .cshrc file that has a prompt specialized to contain the current directory followed by ">" might look like this:

setenv PROMPT "$cwd!> "
setenv EXPECT_PROMPT "> "

There is another potential problem with trying to match shell prompts. Namely, the pattern may match something unexpected such as the message-of-the-day while logging in. In this case, the command will be sent before the prompt appears. When it does eventually appear, the prompt will be misinterpreted as a sign that the program has completed. This one-off error is a general problem that manifests itself when a pattern matches too early.

One way to defend against this problem is to end the pattern with "$“.

set prompt "(%|#|\$) $"

A bare $ matches the end of the input, analogously to the way ^ matches the beginning of the input. If more data has arrived that cannot be matched, expect continues waiting. This is very similar to the way people distinguish prompts. If you see your prompt (or something that looks even close to a prompt) in the middle of the message-of-the-day, you will not be fooled because you will see the computer continuing to print more text.

On the other hand, Expect is much faster then you are. The computer may appear to have stopped typing, perhaps even in the middle of the message-of-the-day, only because the CPU is being shared among many tasks. This unintentional pause can make Expect think that all the input has arrived.

There is no perfect solution. You can start another expect command to wait for a few more seconds to see if any more input arrives. But there is no guarantee even this or any time limit is good enough. Even humans can be faked out by random system indigestion. When was the last time you thought your program was hung, so you started pressing ^C only to find out that the network or some other resource was just temporarily overloaded?[25]

While it is possible to take extra steps, there simply is no way to guarantee that something that looks like a prompt really is a prompt, considering that even humans can be fooled. But for most purposes, a $ helps substantially in the face of unknown data that could unintentionally cause a premature match.

Fortunately, the specific problem of changing messages-of-the-day is moot. On most systems, the message-of-the-day no longer serves as a news distribution mechanism, having been replaced by superior interfaces such as Usenet news or other bulletin board-like systems. Instead, the message-of-the-day usually contains a message describing the revision level of the system, copyrights, or other information that rarely changes. In this case, users just need to choose patterns that avoid matching this characteristic information while still matching their prompts.

Example—A Smart Remote Login Script

In Chapter 3 (p. 83), I showed a script called aftp. This automated the initialization for an anonymous ftp session and then passed control from the script to the user. The same idea can be applied to many other programs.

Imagine the following scenario. You try to create a file only to find out that your computer has no permission to change the file system because it is mounted read-only from another computer, the file server.

% rm libc.a
rm: libc.a not removed: Read-only file system

All you have to do is log in to the server and then repeat the command:

% rlogin server

You are logged in to the server. Please be careful.

% rm libc.a
rm: libc.a: No such file or directory

Oops. You are not in the right directory. rlogin does not propagate your current working directory. Now you have to find out what directory you were in, and enter the appropriate cd command. If the directory is long enough or if you do this kind of thing frequently, the procedure can become a nuisance.

Here is a script to automatically rlogin in such a way that you are automatically placed in the same directory on the remote host.

set cwd [pwd]
spawn rlogin $argv
expect "% "
send "cd $cwd
"
expect "% "
interact

The scripts starts by running pwd, which returns the current directory. The result is saved in the variable cwd. Then rlogin is spawned. When the prompt arrives, the script sends the cd command and then waits for another prompt. Finally, interact is executed and control is returned to the keyboard.

When you run the script, the output appears just as if you had actually typed the cd command yourself.

mouse1% rloginwd duck
spawn rlogin duck
You are logged in to duck.  Quack!
duck1% cd /usr/don/expect/book/chapter4/scripts
duck2%

The script is purposely simplified for readability. But it can be simplified even further. Doing so illustrates several important points. When pwd is executed, it runs on the same machine on which the script is running. Even after rlogin is spawned, pwd is run on the original system. You can move the pwd right into the send command, thereby obviating the need for the variable cwd altogether. pwd will still refer to the original directory. Indeed, even if you send yet another rlogin command to the remote host, the script continues to run on the original host. Remember that commands started via exec, spawn, catch or otherwise evaluated by Tcl are run on the original host in the original process. Commands that are sent via send operate in whatever new context has been established in the spawned process.

Just before the interact command is an expect command to wait for the prompt. This is actually unnecessary. What happens without the expect? The interact command gets control immediately. But what happens then? interact waits for either the user to type or the system to print something. Of course, the user will wait for the prompt, and when it arrives, interact will echo it so that the user can see it.

The difference then is that with the explicit expect, expect does the waiting, while with no expect, interact waits. In theory, the user could type only during interact, but in reality, the user will wait for the prompt in either case, so there is no functional difference. Hence, the expect can be omitted. Here is the final script, with these minor changes and all the other good stuff added back.

#!/usr/local/bin/expect --
set timeout −1
eval spawn rlogin $argv
if [info exists env(EXPECT_PROMPT)] {
    set prompt $env(EXPECT_PROMPT)
} else {
    set prompt "(%|#|\$) $"          ;# default prompt
}
expect -re $prompt
send "cd [pwd]
"
interact

The technique shown in this script can be used for all sorts of things besides setting the current working directory. For example, you could copy all the environment variables. On systems running the X window system, it is useful to initialize the environment variable DISPLAY with the display name of your local host. Then commands executed on the remote system will be displayed on your local screen. If you use the same script on the second host to remotely login to yet another host, the original host definitions will continue to be used. If the remote system has no access to the local file system, it might also be useful to copy the X authority file.

What I have shown here is just the tip of the iceberg. The interact command can do all sorts of other interesting things. These are described in Chapter 15 (p. 319) and Chapter 16 (p. 345).

One final note: This script assumes that rlogin does not prompt for a password. If it does, the script will fail. Explicitly waiting for passwords requires a little extra work. I will demonstrate how to do that in Chapter 8 (p. 195). In Chapter 15 (p. 335), I will demonstrate a simpler and more general approach to this kind of script that allows passwords and any other interactions to be handled automatically.

What Else Gets Stored In expect_out

In this chapter and the previous one I have shown how the array expect_out is used to store strings that match parenthesized subpatterns. The expect command can also set several other elements of expect_out.

If the pattern is preceded by the -indices flag, two other elements are stored for each element expect_out(X,string) where X is a digit. expect_out(X,start) contains the starting position of the first character of the string in expect_out(buffer). expect_out(X,end) contains the ending position.

Here is a fragment of one of the earlier rsh scripts, just as it queries the remote shell for the status, but with the -indices flag added:

send "echo $status
"
expect -indices -re "
(.*)
"

The -indices flag precedes the pattern (including its type). In later chapters, you will learn about other expect flags. All of these flags follow this pattern—they precede the pattern.

With the -indices flag, the expect command implicitly makes the following assignments:

set expect_out(0,start) "12"
set expect_out(0,end) "16"
set expect_out(0,string) "
0
"
set expect_out(1,start) "14"
set expect_out(1,end) "14"
set expect_out(1,string) "0"
set expect_out(buffer) "echo $status
0
"

These elements are set before an expect action begins executing. To be precise, the scope inside an expect action is exactly the same scope as that which contains the expect itself. In simple terms, as long as you are in a single procedure, a variable defined before the expect (or within it) can be referred to from inside an expect action. Similarly, a variable defined by expect or within an expect action can be accessed after the expect has completed. I will provide more detail about actions and scopes in Chapter 6 (p. 138).

Later in the book, I will describe yet additional elements of expect_out. However, these are the only ones that are important for now.

More On Anchoring

In the example on page 118, I defined a prompt as “(%|#|\$) $”. An obvious question is: “Why do you need those backslashes? (Or perhaps, “Okay, I know I need backslashes, but why three? Why not seventeen or some other random number?!”)

The answer is exactly the same reason as I described in Chapter 4 (p. 91). In this case, the dollar sign is special both to the pattern matcher and to Tcl. This is similar to the problem with the "[“.

To restate, the dollar sign must be prefaced with a backslash to get the pattern matcher to use it as a literal character. Without the backslash, the pattern matcher will use the dollar sign to match the end of the string. However, both the backslash and the dollar sign are also special to Tcl. When Tcl parses command arguments, it will try to make substitutions whenever it sees backslashes and dollar signs.

To avoid losing the backslash, the backslash itself must be prefaced with a backslash. Similarly, the dollar sign must be prefaced with a backslash. The result is "\" and "$“, or when combined, "\$“.

As always, there is more to this sorry story. The dollar sign substitution made by Tcl only occurs when an alphanumeric character follows. That is, "$foo" is replaced by the value of the variable foo. Since "$" all by itself would imply a variable name of no characters—obviously meaningless—Tcl does not perform any substitution on the dollar sign in such cases. For this reason, you can write the original pattern with two or three backslashes. Both have the same effect.

expect -re "(%|#|\$) $"    ;# RIGHT
expect -re "(%|#|\$) $"   ;# RIGHT

However, in the case where you are matching the literal "$a“, you need three.

expect -re "(%|#|\$a) $"   ;# WRONG
expect -re "(%|#|\$a) $"  ;# RIGHT

This non-substitution behavior occurs anywhere "$" is not followed by an alphanumeric character such as in the string "$+“. It is good programming practice to always use three backslashes when trying to match a dollar sign. For example, if you unthinkingly change the "+" to an "x" one day, the script will continue to work if you had used three backslashes originally but not if you had used two.

Nonetheless, even if you always use three backslashes, you should be aware that two work in this case. Matching a dollar sign prompt is so common, you are likely to see this a lot in other people’s scripts.

Earlier in this section, I mentioned that the $ is special to the pattern matcher. To be more specific, unless preceded by a backslash, the $ is special no matter where in the string it appears. The same holds for "^“. At this point, you might be asking yourself: Why does the pattern matcher insist on interpreting a $ as the end of a string even when it is in the middle of the pattern? The reason is that there are patterns where this is exactly the behavior you want. Consider the following:

expect -re "% $|foo"

This commands waits for either a "%" prompt or the string foo to appear. Clearly, the $ is in the middle of the pattern and makes sense. More complex scenarios are possible. For example, the pattern above might be stored in a variable and substituted into the middle of another pattern. You would still want the $ to have the same effect.

In contrast, glob patterns are much simpler. They do not support concepts such as alternation. So attempting to anchor the middle of a glob pattern is pointless. For this reason, the glob pattern matcher only treats ^ as special if it appears as the first character in a pattern and a $ if it appears as the last character.

Exercises

  1. Modify the timed-read script on page 112 so that it can take an optional default answer which will be returned if the timeout is exceeded.

  2. Modify the rsh script on page 118 so that it returns the remote status of programs no matter how much output they produce.

  3. Modify the rlogin script on page 120 so that it copies the X authority file. Modify the script so that it also supports telnet.

  4. The dump program starts by printing an estimate such as:DUMP: estimated 1026958 blocks (501.44MB) on 0.28 tape(s).Write a script to print out how many tapes the next backup will take.

  5. dump’s knowledge of backup peripherals is often out of date. Assuming you are using a device much bigger than dump realizes, write a script so that dump never stops and asks for a new tape.

  6. Enhance the script you wrote for the previous question so that it does a better job of asking for tapes. For example, if your tapes are ten times as big as dump thinks, then the script should ask only once for every ten times that dump asks.



[21] Any non-keyword pattern beginning with a hyphen must be preceded with "-gl" (or some other pattern type). This permits the addition of future flags in Expect without breaking existing scripts.

[22] The regular expressions described here differ slightly from POSIX.2 regular expressions. For example, POSIX regular expressions take the branch that matches the longest sequence of characters. As of May 1994, no plans have been formalized to introduce POSIX-style regular expressions. There are enough minor differences that, if POSIX regular expressions were added, they would likely be added as a new pattern type, rather than as a replacement for the existing regular expressions.

[23] Unlike the usual meaning of .* which matches as many characters as possible, this imaginary .* matches as few characters as possible while still allowing the match to succeed. Chapter 3 (p. 73) describes this in terms of anchoring.

[24] In reality, it is fairly unusual to use more than four or five pairs of parentheses. I have never run up to the limit of nine.

[25] In Computer Lib, Ted Nelson described a system administrator who was plagued by continual computer crashes. The system administrator eventually decided to blame a miscreant with the initials RH after discovering that a program named RHBOMB was always running when the system crashed. Several months later, the same system administrator noticed a file called RHBOMB. Rather than immediately accuse RH of hacking again, the system administrator decided to first look at the file. He issued the command to print the file on the screen: PRINT RHBOMB. All of a sudden, his terminal printed "TSS HAS GONE DOWN" and stopped. No prompt. Nothing else appeared. The system administrator thought to himself: “Incredible—a program so virulent that just listing it crashed the system!”

His fear was unjustified. The file turned out to be the string "TSS HAS GONE DOWN" followed by thousands of null characters, effectively not printing anything but delaying the system from printing the next prompt for a long time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.246.203