Chapter 8. Regular Expressions

Included in the .NET Framework Class Library is the System.Text.RegularExpressions namespace that is devoted to creating, executing, and obtaining results from regular expressions executed against a string.

Regular expressions take the form of a pattern that can be matched to zero or more characters within a string. The simplest of these patterns, such as .* (match anything and everything) and [A-Za-z] (match any letter) are easy to learn, but more advanced patterns can be difficult to learn and even more difficult to implement correctly. Learning and understanding regular expressions can take considerable time and effort, but the work will pay off.

Regular expression patterns can take a simple form—such as a single word or character—or a much more complex pattern. The more complex patterns can recognize and match such things as the year portion of a date, all of the <SCRIPT> tags in an ASP page, or a phrase in a sentence that varies with each use. The .NET regular expression classes provide a very flexible and powerful way to do such things as recognize text, replace text within a string, and split up text into individual sections based on one or more complex delimiters.

Despite the complexity of regular expression patterns, the regular expression classes in the FCL are easy to use in your applications. Executing a regular expression consists of the following steps:

  1. Create an instance of the Regex object that contains the regular expression pattern along with any options for executing that pattern.

  2. Retrieve a reference to an instance of the Match object by calling the Match instance method if you want only the first match found, or to an instance of the MatchesCollection object by calling the Matches instance method if you want more than just the first match found.

  3. If you’ve called the Matches method to retrieve a MatchCollection object, iterate over the MatchCollection using a foreach loop. Each iteration will allow access to every Match object that the regular expression produced.

8.1. Enumerating Matches

Problem

You need to find one or more substrings corresponding to a particular pattern within a string. You need to be able to inform the searching code to return either all matching substrings or only the matching substrings that are unique within the set of all matched strings.

Solution

Call the FindSubstrings method, which executes a regular expression and obtains all matching text. This method returns either all matching results or only the unique matches; this behavior is controlled by the findAllUnique parameter. Note that if the findAllUnique parameter is set to true, the unique matches are returned sorted alphabetically. Its source code is as follows:

using System;
using System.Collections;
using System.Text.RegularExpressions;

public static Match[] FindSubstrings(string source, string matchPattern,
                                     bool findAllUnique)
{
    SortedList uniqueMatches = new SortedList( );
    Match[] retArray = null;

    Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
    MatchCollection theMatches = RE.Matches(source);

    if (findAllUnique)
    {
        for (int counter = 0; counter < theMatches.Count; counter++)
        {
            if (!uniqueMatches.ContainsKey(theMatches[counter].Value))
            {
                uniqueMatches.Add(theMatches[counter].Value, 
                                  theMatches[counter]);
            }
        }

        retArray = new Match[uniqueMatches.Count];
        uniqueMatches.Values.CopyTo(retArray, 0);
    }
    else
    {
        retArray = new Match[theMatches.Count];
        theMatches.CopyTo(retArray, 0);
    }

    return (retArray);
}

The following method searches for any tags in an XML string; it does this by searching for a block of text that begins with the < character and ends with the > character.

This method first displays all unique tag matches present in the XML string and then displays all tag matches within the string:

public static void TestFindSubstrings( )
{
    string matchPattern = "<.*>";

    string source = @"<?xml version='1.0' encoding='UTF-8'?>
             <!-- my comment -->
             <![CDATA[<escaped> <><chars>>>>>]]>
             <Window ID='Main'>
               <Control ID='TextBox'>
                 <Property Top='0' Left='0' Text='BLANK'/>
               </Control>
               <Control ID='Label'>
                 <Property Top='0' Left='0' Caption='Enter Name Here'/>
               </Control>
               <Control ID='Label'>
                 <Property Top='0' Left='0' Caption='Enter Name Here'/>
               </Control>
             </Window>";

    Console.WriteLine("UNIQUE MATCHES");
    Match[] x1 = FindSubstrings(source, matchPattern, true);
    foreach(Match m in x1)
    {
        Console.WriteLine(m.Value);
    }

    Console.WriteLine( );
    Console.WriteLine("ALL MATCHES");
    Match[] x2 = FindSubstrings(source, matchPattern, false);
    foreach(Match m in x2)
    {
        Console.WriteLine(m.Value);
    }
}

The following text will be displayed:

UNIQUE MATCHES
<!-- my comment -->
<![CDATA[<escaped> <><chars>>>>>]]>
</Control>
</Window>
<?xml version="1.0" encoding="UTF-8"?>
<Control ID="Label">
<Control ID="TextBox">
<Property Top="0" Left="0" Caption="Enter Name Here"/>
<Property Top="0" Left="0" Text="BLANK"/>
<Window ID="Main">

ALL MATCHES
<?xml version="1.0" encoding="UTF-8"?>
<!-- my comment -->
<![CDATA[<escaped> <><chars>>>>>]]>
<Window ID="Main">
<Control ID="TextBox">
<Property Top="0" Left="0" Text="BLANK"/>
</Control>
<Control ID="Label">
<Property Top="0" Left="0" Caption="Enter Name Here"/>
</Control>
<Control ID="Label">
<Property Top="0" Left="0" Caption="Enter Name Here"/>
</Control>
</Window>

Discussion

As you can see, the regular expression classes in the FCL are quite easy to use. The first step is to create an instance of the Regex object that contains the regular expression pattern along with any options for running this pattern. The second step is to get a reference to an instance of the Match object, if you only need the first found match, or a MatchCollection object, if you need more than just the first found match. To get a reference to this object, the two instance methods Match and Matches can be called from the Regex object that was created in the first step. The Match method returns a single match object (Match) and Matches returns a collection of match objects (MatchCollection).

The FindSubstrings method returns an array of Match objects that can be used by the calling code. You might have noticed that the unique elements are returned sorted, and the nonunique elements are not sorted. A SortedList, which is used by the FindSubstrings method to store unique strings that match the regular expression pattern, automatically sorts its items when they are added.

The regular expression used in the TestFindSubstrings method is very simplistic and will work in most—but not all—conditions. For example, if two tags are on the same line, as shown here:

<tagData></tagData>

the regular expression will catch the entire line, not each tag separately. You could change the regular expression from <.*> to <[^>]*> to match only up to the closing > ([^>]* matches everything that is not a >). However, this will fail in the CDATA section, matching <![CDATA[<escaped>, <>, and <chars> instead of <![CDATA[<escaped> <><chars>>>>>]]>. The more complicated @"(<![CDATA.*>|<[^>]*>)" will match either <![CDATA.*> (a greedy match for everything within the CDATA section) or <[^>]*>, described previously.

See Also

See the “.NET Framework Regular Expressions” and “SortedList Class” topics in the MSDN documentation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.79.241