5.6. Regular Expressions

The Firebird segment of Fantasia 2000 has only three characters: a sprite, the firebird of the title, and a somewhat fastidious elk. Although the elk is the least remarked of the three, he is actually one of the more remarkable figures. His body and facial expressions are hand-drawn by traditional Disney animation techniques. His antlers, however, are a three-dimensional computer model. The challenge was to match the movement of the antlers with the hand-drawn two- dimensional animation. Solving this problem brought not only the elk to Fantasia 2000; it brought me along as well.

My computer-generated imagery (CGI) supervisor, Chyuan, came up with the splendid idea of capturing the stream of camera positions mapping the movements of the hand-drawn elk. These movements were then transformed into a set of curves and fed into the three-dimensional animation package. The antler was manually positioned at the start of the scene to sit atop the elk's head. The curves match the antlers' movement with that of the elk as the scene unfolds.

Chyuan patiently explained the math to me; I did the actual programming, first in C++, then in Perl, a scripting language. The Perl implementation was considerably smaller than the equivalent implementation in C++, largely because of the regular-expression support built into Perl. Regular-expression support under .NET is found in the System.Text.RegularExpressions namespace. This is the topic of this section.

What is a regular expression? It is a pattern of characters and symbols representing a character sequence of arbitrary length. For example, let's say we need to find all lines of text that begin with the number 5. The number can be any length, but it must be followed by a dash and then the letter a, b, or c, followed by one or more letters or characters. It must end with the sequence 2001. To use a regular expression, we need symbols to do the following:

  • Indicate that we wish to begin the search at the beginning of the line. We do this with the caret (^). So, for example, ^5 means we want the line to begin with a literal value of 5.

  • Indicate that we wish to match on a particular flavor of character. So, for example, d means that we wish to match on a single digit between 0 and 9. D means that we wish to match on a single character that is not a digit. s means that we wish to match on a single white-space character, and S means that we wish to match on a single character that is not white space. w matches on any alphanumeric character (a to z, A to Z, 0 to 9). W matches any character that is not alphanumeric.

  • Indicate that we wish to match on any character, regardless of its type. For example, the period matches any non-new-line character.

  • Indicate that we wish to match on multiple (or no) instances of a character type. The plus operator (+) means that we wish to match on one or more characters of the same type. d+, for example, matches on 2, 22, 1217, and so on. The asterisk (*) means that we wish to allow for no matches as well. For example, the regular expression

    ^5d+D+2001
    

    matches on any line that begins with 5 followed by one or more additional digits, followed by one or more nondigit characters followed by the literal 2001. The regular expression

    ^5d*D*2001
    

    matches on every line that begins with 5 and ends with the literal 2001. Between the 5 and 2001, there may or may not be some digits followed by some nondigit characters.

  • Indicate that we wish to match on a fixed number of characters. For example, the regular expression

    d{3}-d{4}
    

    requires three digits followed by a hyphen, followed by four digits, such as 375-4128.

  • Indicate that we wish to match on one of a set of different characters. We do this by placing a set of alternative characters within parentheses, separated by the bool OR operator (|). For example, the expression a|e|i|o|u means that we wish to match on one of the five English vowels. Adding the addition operator—(a|e|i|o|u)+, that is—means that we wish to match on one or more consecutive occurrences of the five English vowels. Following the expression with an asterisk means that we wish to allow for no matches as well.

Regular expressions take some getting use to. In the beginning they seem quite complicated because they offer such compact notation. To facilitate your exploration of regular expressions, I've written a small regular-expression test program. You enter a string, a regular expression, or both, and the program identifies which matches, if any, occur—for example (note that my console input is highlighted in bold),

Would you like to enter a string to match against? (Y/N/?) y
Please enter a string, or 'quit' to exit.
      ==> 5abc2001

Would you like to change regular expressions? (Y/N/?) y
Please enter regular expression:
      **> ^5d*(a|d|e)w+2001

original string:  5abc2001
attempt to match: ^5d*(a|d|e)w+2001

The characters 5abc2001 match beginning at position 0

Would you like to enter a string to match against? (Y/N/?) y
Please enter a string, or 'quit' to exit.
      ==> 5abc2001

Would you like to change regular expressions? (Y/N/?) n

original string:  527ar2001
attempt to match: ^5d*(a|d|e)w+2001
The characters 527ar2001 match beginning at position 0

Of course, there can be multiple matches as well—for example,

original string:  r24d2
attempt to match: d+
The characters 24 match beginning at position 1
The characters 2 match beginning at position 4

Let's try our hand at programming regular expressions under .NET. First let me show you the code that does the matching; then I'll explain what's going on:

public static void doMatch()
{
   Console.WriteLine( "original string:  {0}", inputString );
   Console.WriteLine( "attempt to match: {0}", filter );
    Regex regex = new Regex( filter );
    Match match = regex.Match( inputString );

    if ( ! match.Success )
    {
       Console.WriteLine( "Sorry, no match of {0} in {1}",
                           filter, inputString );
       return;
    }

    for ( ; match.Success; match = match.NextMatch() )
    {
       Console.WriteLine(
         "The characters {0} match beginning at position {1}",
         match.ToString(), match.Index
       );
    }
}

The Regex class represents our regular expression. We pass its constructor the string representation of the expression, which is compiled into an immutable internal representation. We cannot change the regular expression associated with a Regex object. A two-parameter constructor takes a second string argument containing option characters that modify the matching pattern.

Match() performs the matching algorithm of the regular expression against its string argument. It returns a Match class object that holds the results of the pattern matching. The Match object is also immutable.

To discover if the match succeeded, we query the Success property of the Match class. Each match is spoken of as a capture. The Index property returns the position in the original string where the first character of the captured substring was found. Length returns the length of the captured substring. The ToString() method returns the captured substring.

The Match object that Match() returns holds the results of the first capture. If the regular expression captures multiple substrings, we use NextMatch() to access the second and each subsequent capture. Before we manipulate the next object, we must test that it represents a success. A sentinel Match object for which Success evaluates to false marks the end of the captured substrings. A typical for loop might look like this:

for ( Match match = regex.Match( inputString );
          match.Success;
          match = match.NextMatch() )
{ ... }

Consider the following three lines:

 5040      bez( 99,   -3.194, 43.8,  85    )
 4930.7823 bez( 10.7, 19.59,  -20,  -20.48 )
-5123      bez( -3.5,  2.46,  89,     0.02 )

These are samples of lines that we need to match. First we have to come up with a regular expression that can match each of these lines.

We see that each line begins with a number. The number can be either positive or negative, and it can represent either a scalar or a floating-point value. The number is followed by a space, then the literal substring bez. Four comma-separated numbers follow that, enclosed within parentheses. The numbers can be negative or positive. They can be either integers or floating-point values. Before you look at my solution, try your hand at coming up with a regular expression that captures each of these lines in its entirety.

Once we have our regular expression, we're still not done. Our next problem is how to gain access to the individual parts of the line. That is, our regular expression captures the entire string; we now need to pick it apart to access the five numeric fields.

The regular-expression syntax supports a grouping mechanism in which we assign numbers to particular subfields of the match. We can subsequently use these numbers to access the subfields. For example, the following identifies a group associated with the index 1 using the special ?<1> syntax:

(?<1>(-*d+.d+)|(-*d+))

Can you read this? It represents an alternative pair of regular expressions. The first one,

-*d+.d+

matches a floating-point number that may or may not be negative. The second,

-*d+

matches an integer value that also may or may not be negative. The entire regular expression, with five identified groups, looks like the following. For clarity, I've broken it up and identified each subfield. For realism, I've listed it as a string literal with the necessary double backslash (\) escape:

string filter = "
    // the digit before the bez literal
    (?<1>(-*\d+\.\d+)|(-*\d+))

    // arbitrary white space, bez literal, and open paren
    \s*bez\(

    // the four internal numeric values and literal comma
    (?<2>(-*\d+\.\d+)|(-*\d+)),
    (?<3>(-*\d+\.\d+)|(-*\d+)),
    (?<4>(-*\d+\.\d+)|(-*\d+)),
    (?<5>(-*\d+\.\d+)|(-*\d+))
";

Now we attempt the match on the line of text:

Regex regex = new Regex( filter );
Match match = regex.Match( line );

If the match is successful, we need to grab each of the five numeric subfields and translate them into values of type float:

float loc = Convert.ToSingle(match.Groups[1].ToString());
float m_xoffset1 = Convert.ToSingle( match.Groups[2].ToString());
float m_yoffset1 = Convert.ToSingle( match.Groups[3].ToString());
float m_xoffset2 = Convert.ToSingle(match.Groups[4].ToString());
float m_yoffset2 = Convert.ToSingle(match.Groups[5].ToString());

The Group class represents a capturing group within the returned Match class object. We access each Group object through its associated index. The ToString() method returns the captured substring. In this case we invoke the Convert.ToSingle() method on each string to convert the value into type float.

A useful Regex class method is Split(). This method is similar to the String class Split() methodboth return a string array. Unlike the String class Split() method, however, the Regex class version separates the input string on the basis of a regular expression rather than a set of characters—for example,

string textLine =
    "Danny%Lippman%%Point Guard%Shooting Guard%%floater";

string splitMe = "%+";
Regex  regex = new Regex( splitMe );

foreach ( string capture in regex.Split( textLine ))
          Console.WriteLine( "capture: {0}",
                                     capture );

In this example we are splitting textLine at each point where one or more % characters appear. When executed, this code generates the following output:

capture: Danny
capture: Lippman
capture: Point Guard
capture: Shooting Guard
capture: floater

Another useful Regex class method is Replace(), which allows us to replace captured substrings with alternative text. Here is a simple example:

public static void testReplace()
{
   string re = "XP.\d+";
   Regex regex = new Regex( re );

   string textLine =
      "XP.109 is currently in alpha. " +
      "XP.109 represents a staggering leap forward";

   string replaceWith = "ToonPal";

   Console.WriteLine ( "original text: {0}", textLine );
   Console.WriteLine ( "regular expresion : {0}", re );

   string replacedText = regex.Replace( textLine, replaceWith )
   Console.WriteLine ( "replacement text: {0}", replacedText );
}

When compiled and executed, this code generates the following output (reformatted slightly for better display):

original text: XP.109 is currently in alpha. XP.109 represents a
          staggering leap forward

regular expresion : XP.d+

replacement text: ToonPal is currently in alpha. ToonPal
          represents a staggering leap forward

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.35