Chapter 7. Regular Expressions

7.0 Introduction

The .NET Framework Class Library (FCL) includes the System.Text.RegularExpressions namespace, which is devoted to creating, executing, and obtaining results from regular expressions executed against a string.

Regular expressions take the form of a pattern that matches zero or more characters within a string. The simplest of these patterns, such as .* (which matches anything except newline characters) and [A-Za-z] (which matches any letter) are easy to learn, but more advanced patterns can be difficult to learn and even more difficult to implement correctly. Learning and understanding regular expressions can take considerable time and effort, but the work will pay off.

Note

Two books that will help you learn and expand your understanding of regular expressions are Michael Fitzgerald’s Introducing Regular Expressions and Jan Goyvaerts and Steven Levithan’s Regular Expressions Cookbook, both from O’Reilly.

Regular expression patterns can take a simple form—such as a single word or character—or a much more complex pattern. The more complex patterns can recognize and match such items as the year portion of a date, all of the <SCRIPT> tags in an ASP page, or a phrase in a sentence that varies with each use. The .NET regular expression classes provide a very flexible and powerful way to perform tasks such as recognizing text, replacing text within a string, and splitting up text into individual sections based on one or more complex delimiters.

Despite the complexity of regular expression patterns, the regular expression classes in the FCL are easy to use in your applications. Executing a regular expression consists of the following steps:

  1. Create an instance of a Regex object that contains the regular expression pattern along with any options for executing that pattern.

  2. Retrieve a reference to an instance of a Match object by calling the Match instance method if you want only the first match found. Or, retrieve a reference to an instance of the MatchesCollection object by calling the Matches instance method if you want more than just the first match found. If, however, you want to know only whether the input string was a match and do not need the extra details on the nature of the match, you can use the Regex.IsMatch method.

  3. If you’ve called the Matches method to retrieve a MatchCollection object, iterate over the MatchCollection using a foreach loop. Each iteration will allow access to every Match object that the regular expression produced.

7.1 Extracting Groups from a MatchCollection

Problem

You have a regular expression that contains one or more named groups (also known as named capture groups), such as the following:

\\(?<TheServer>w*)\(?<TheService>w*)\

where the named group TheServer will match any server name within a UNC string, and TheService will match any service name within a UNC string.

Note

This pattern does not match the UNCW format.

You need to store the groups that are returned by this regular expression in a keyed collection (such as a Dictionary<string, Group>) in which the key is the group name.

Solution

The ExtractGroupings method shown in Example 7-1 obtains a set of Group objects keyed by their matching group name.

Example 7-1. ExtractGroupings method
using System;
using System.Collections;
using System.Collections.Generics;
using System.Text.RegularExpressions;

public static List<Dictionary<string, Group>> ExtractGroupings(string source
                                                           string matchPattern,
                                                           bool wantInitialMatch)
{
    List<Dictionary<string, Group>> keyedMatches =
        new List<Dictionary<string, Group>>();
    int startingElement = 1;
    if (wantInitialMatch)
    {
        startingElement = 0;
    }

    Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
    MatchCollection theMatches = RE.Matches(source);

    foreach(Match m in theMatches)
    {
        Dictionary<string, Group> groupings = new Dictionary<string, Group>();

        for (int counter = startingElement; counter < m.Groups.Count; counter++)
        {
            // If we had just returned the MatchCollection directly, the
            // GroupNameFromNumber method would not be available to use.
            groupings.Add(RE.GroupNameFromNumber(counter), m.Groups[counter]);
        }
        keyedMatches.Add(groupings);
    }
    return (keyedMatches);
}

The ExtractGroupings method can be used in the following manner to extract named groups and organize them by name:

public static void TestExtractGroupings()
{
    string source = @"Path = ""\MyServerMyServiceMyPath;
                              \MyServer2MyService2MyPath2""";
    string matchPattern = @"\\(?<TheServer>w*)\(?<TheService>w*)\";

    foreach (Dictionary<string, Group> grouping in
             ExtractGroupings(source, matchPattern, true))
    {
        foreach (KeyValuePair<string, Group> kvp in grouping)
            Console.WriteLine($"Key/Value = {kvp.Key} / {kvp.Value}");
        Console.WriteLine("");
    }
}

This test method creates a source string and a regular expression pattern in the MatchPattern variable. The two groupings in this regular expression are highlighted here:

string matchPattern = @"\\(?<TheServer>w*)\(?<TheService>w*)\";

The names for these two groups are TheServer and TheService. Text that matches either of these groupings can be accessed through these group names.

The source and matchPattern variables are passed in to the ExtractGroupings method, along with a Boolean value, which is discussed shortly. This method returns a List<T> containing Dictionary<string,Group> objects. These Dictionary<string,Group> objects contain the matches for each of the named groups in the regular expression, keyed by their group name.

This test method, TestExtractGroupings, returns the following:

Key / Value = 0 / \MyServerMyService
Key / Value = TheService / MyService
Key / Value = TheServer / MyServer

Key / Value = 0 / \MyServer2MyService2
Key / Value = TheService / MyService2
Key / Value = TheServer / MyServer2

If the last parameter to the ExtractGroupings method were to be changed to false, the following output would result:

Key / Value = TheService / MyService
Key / Value = TheServer / MyServer

Key / Value = TheService / MyService2
Key / Value = TheServer / MyServer2

The only difference between these two outputs is that the first grouping is not displayed when the last parameter to ExtractGroupings is changed to false. The first grouping is always the complete match of the regular expression.

Discussion

Groups within a regular expression can be defined in one of two ways. The first way is to add parentheses around the subpattern that you wish to define as a grouping. This type of grouping is sometimes labeled as unnamed. Later you can easily extract this grouping from the final text in each returned Match object by running the regular expression. The regular expression for this recipe could be modified, as follows, to use a simple unnamed group:

string matchPattern = @"\\(w*)\(w*)\";

After running the regular expression, you can access these groups using a numeric integer value starting with 1.

The second way to define a group within a regular expression is to use one or more named groups. You define a named group by adding parentheses around the subpattern that you wish to define as a grouping and adding a name to each grouping, using the following syntax:

(?<Name>w*)

The Name portion of this syntax is the name you specify for this group. After executing this regular expression, you can access this group by the name Name.

To access each group, you must first use a loop to iterate each Match object in the MatchCollection. For each Match object, you access the GroupCollection’s indexer, using the following unnamed syntax:

string group1 = m.Groups[1].Value;
string group2 = m.Groups[2].Value;

or the following named syntax, where m is the Match object:

string group1 = m.Groups["Group1_Name"].Value;
string group2 = m.Groups["Group2_Name"].Value;

If the Match method was used to return a single Match object instead of the MatchCollection, use the following syntax to access each group:

// Unnamed syntax
string group1 = theMatch.Groups[1].Value;
string group2 = theMatch.Groups[2].Value;

// Named syntax
string group1 = theMatch.Groups["Group1_Name"].Value;
string group2 = theMatch.Groups["Group2_Name"].Value;

where theMatch is the Match object returned by the Match method.

See Also

The “.NET Framework Regular Expressions” and “Dictionary Class” topics in the MSDN documentation.

7.2 Verifying the Syntax of a Regular Expression

Problem

You have constructed a regular expression dynamically, either from your code or based on user input. You need to test the validity of this regular expression’s syntax before you actually use it.

Solution

Use the VerifyRegEx method shown in Example 7-2 to test the validity of a regular expression’s syntax.

Example 7-2. VerifyRegEx method
using System;
using System.Text.RegularExpressions;

public static bool VerifyRegEx(string testPattern)
{
    bool isValid = true;
    if ((testPattern?.Length ?? 0) > 0)
    {
        try
        {
           Regex.Match("", testPattern);
        }
        catch (ArgumentException)
        {
           // BAD PATTERN: syntax error
            isValid = false;
        }
    }
    else
    {
        //BAD PATTERN: pattern is null or empty
        isValid = false;
    }

    return (isValid);
}

To use this method, pass it the regular expression that you wish to verify:

public static void TestUserInputRegEx(string regEx)
{
    if (VerifyRegEx(regEx))
        Console.WriteLine("This is a valid regular expression.");
    else
        Console.WriteLine("This is not a valid regular expression.");
}

Discussion

The VerifyRegEx method calls the static Regex.Match method, which is useful for running regular expressions on the fly against a string. The static Regex.Match method returns a single Match object. By using this static method to run a regular expression against a string (in this case, an empty string), you can determine whether the regular expression is invalid by watching for a thrown exception. The Regex.Match method will throw an ArgumentException if the regular expression is not syntactically correct. The Message property of this exception contains the reason the regular expression failed to run, and the ParamName property contains the regular expression passed to the Match method. Both of these properties are read-only.

Before testing the regular expression with the static Match method, VerifyRegEx tests the regular expression to see if it is null or blank. A null regular expression string returns an ArgumentNullException when passed in to the Match method. On the other hand, if a blank regular expression is passed in to the Match method, no exception is thrown (as long as a valid string is also passed to the first parameter of the Match method).

While this recipe validates whether or not the regular expression syntax is correct, it does not look for poorly written expressions. One common case of poorly written regular expressions is when the expressions rely on backtracking. Backtracking can cause the regular expression to take an exponentially long time to complete, making it appear as if the code executing the regular expression has frozen.

Note

For a thorough explanation of backtracking in regular expressions, read the MSDN topic “Backtracking” under the “.NET Framework Regular Expressions” parent topic.

In cases where regular expressions use backtracking, it is recommended that you use a timeout value to limit the time a regular expression has to complete. Use the following RegEx constructor:

Regex (String, RegexOptions, TimeSpan)

where TimeSpan is the length of time within which the regular expression is allowed to execute:

Regex regex = new RegEx(bkTrkPattern, RegexOptions.None,
                        TimeSpan.FromMilliseconds(1000));

You can then execute the regular expression within a try-catch block, using the RegexMatchTimeoutException to catch a poorly written regular expression that takes an unusually long time to execute.

7.3 Augmenting the Basic String Replacement Function

Problem

You need to replace character patterns within the target string with a new string. However, in this case, each replacement operation has a unique set of conditions that must be satisfied in order to allow the replacement to occur.

Solution

Use the overloaded instance Replace method shown in Example 7-3, which accepts a MatchEvaluator delegate along with its other parameters. The MatchEvaluator delegate is a callback method that overrides the default behavior of the Replace method.

Example 7-3. Overloaded Replace method that accepts a MatchEvaluator delegate
using System;
using System.Text.RegularExpressions;

public static string MatchHandler(Match theMatch)
{
    // Handle all ControlID_ entries.
    if (theMatch.Value.StartsWith("ControlID_", StringComparison.Ordinal))
    {
        long controlValue = 0;

        // Obtain the numeric value of the Top attribute.
        Match topAttributeMatch = Regex.Match(theMatch.Value, "Top=([-]*\d*)");
        if (topAttributeMatch.Success)
        {

            if (topAttributeMatch.Groups[1].Value.Trim().Equals(""))
            {
                // If blank, set to zero.
                return (theMatch.Value.Replace(
                        topAttributeMatch.Groups[0].Value.Trim(),
                        "Top=0"));
            }
            else if (topAttributeMatch.Groups[1].Value.Trim().StartsWith("-"
                                         , StringComparison.Ordinal))
            {
               // If only a negative sign (syntax error), set to zero.
               return (theMatch.Value.Replace(
                       topAttributeMatch.Groups[0].Value.Trim(), "Top=0"));
            }
            else
            {
                // We have a valid number.
                // Convert the matched string to a numeric value.
                controlValue = long.Parse(topAttributeMatch.Groups[1].Value,
                           System.Globalization.NumberStyles.Any);
               // If the Top attribute is out of the specified range,
               // set it to zero.
               if (controlValue < 0 || controlValue > 5000)
               {
                   return (theMatch.Value.Replace(
                           topAttributeMatch.Groups[0].Value.Trim(),
                           "Top=0"));
               }
            }
        }
    }

    return (theMatch.Value);
}

The callback method for the Replace method is shown here:

public static void ComplexReplace(string matchPattern, string source)
{
    MatchEvaluator replaceCallback = new MatchEvaluator(MatchHandler);
    Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
    string newString = RE.Replace(source, replaceCallback);

    Console.WriteLine($"Replaced String = {newString}");
}

To use this callback method with the static Replace method, modify the previous ComplexReplace method as follows:

public void ComplexReplace(string matchPattern, string source)
{
    MatchEvaluator replaceCallback = new MatchEvaluator(MatchHandler);
    string newString = Regex.Replace(source, matchPattern, replaceCallback);
    Console.WriteLine("Replaced String = " + newString);
}

where source is the original string to run the replace operation against, and matchPattern is the regular expression pattern to match in the source string.

If the ComplexReplace method is called from the following code:

public static void TestComplexReplace()
{
    string matchPattern = "(ControlID_.*)";
    string source = @"WindowID=Main
    ControlID_TextBox1 Top=–100 Left=0 Text=BLANK
    ControlID_Label1 Top=9999990 Left=0 Caption=Enter Name Here
    ControlID_Label2 Top= Left=0 Caption=Enter Name Here";

    ComplexReplace(matchPattern, source);
}

only the Top attributes of the ControlID_* lines are changed from their original values to 0.

The result of this replace action will change the Top attribute value of a ControlID_* line to 0 if it is less than 0 or greater than 5,000. Any other tag that contains a Top attribute will remain unchanged. The following three lines of the source string will be changed from:

ControlID_TextBox1 Top=–100 Left=0 Text=BLANK
ControlID_Label1 Top=9999990 Left=0 Caption=Enter Name Here
ControlID_Label2 Top= Left=0 Caption=Enter Name Here";

to:

ControlID_TextBox1 Top=0 Left=0 Text=BLANK
ControlID_Label1 Top=0 Left=0 Caption=Enter Name Here
ControlID_Label2 Top=0 Left=0 Caption=Enter Name Here";

Discussion

The MatchEvaluator delegate, which is automatically invoked when it is supplied as a parameter to the Regex class’s Replace method, allows for custom replacement of each string that conforms to the regular expression pattern.

If the current Match object is operating on a ControlID_* line with a Top attribute that is out of the specified range, the code within the MatchHandler callback method returns a new modified string. Otherwise, the currently matched string is returned unchanged. This allows you to override the default Replace functionality by modifying only that part of the source string that meets certain criteria. The code within this callback method gives you some idea of what you can accomplish using this replacement technique.

To make use of this callback method, you need a way to call it from the ComplexReplace method. First, a variable of type System.Text.RegularExpressions.MatchEvaluator is created. This variable (replaceCallback) is the delegate that is used to call the MatchHandler method:

MatchEvaluator replaceCallback = new MatchEvaluator(MatchHandler);

Finally, the Replace method is called with the reference to the MatchEvaluator delegate passed in as a parameter:

string newString = Regex.Replace(source, matchPattern, replaceCallback);

See Also

The “.NET Framework Regular Expressions” topic in the MSDN documentation.

7.4 Implementing a Better Tokenizer

Problem

You need a tokenizer—also referred to as a lexer—that can split up a string based on a well-defined set of characters.

Solution

With the Split method of the Regex class, you can create a regular expression to indicate the types of tokens and separators that you are interested in gathering. This technique works especially well with equations, since the tokens of an equation are well defined. For example, the code:

using System;
using System.Text.RegularExpressions;

public static string[] Tokenize(string equation)
{
    Regex re = new Regex(@"([+–*()^\])");
    return (re.Split(equation));
}

will divide up a string according to the regular expression specified in the Regex constructor. In other words, the string passed in to the Tokenize method will be divided up based on the delimiters +, , *, (, ), ^, and . The following method will call the Tokenize method to tokenize the equation (y – 3)*(3111*x^21 + x + 320):

public static void TestTokenize()
{
    foreach(string token in Tokenize("(y – 3)*(3111*x^21 + x + 320)"))
        Console.WriteLine("String token = " + token.Trim());
}

which displays the following output:

string token =
String token = (
String token = y
String token = -
String token = 3
String token = )
String token = *
String token = (
String token = 3111
String token = *
String token = x
String token = ^
String token = 21
String token = +
String token = x
String token = +
String token = 320
String token = )
String token =

Notice that each individual operator, parenthesis, and number has been broken out into its own separate token.

Discussion

In real-world projects, you do not always have the luxury of being able to control the set of inputs to your code. By making use of regular expressions, you can take the original tokenizer and make it flexible enough to allow it to be applied to many types or styles of input.

The key method used here is the Split instance method of the Regex class. The return value of this method is a string array with elements that include each individual token of the source string—the equation, in this case.

Note that the static Split method allows RegexOptions enumeration values to be used, while the instance method allows for a starting position to be defined and a maximum number of matches to occur. This may have some bearing on whether you choose the static or instance method.

See Also

The “.NET Framework Regular Expressions” topic in the MSDN documentation.

7.5 Returning the Entire Line in Which a Match Is Found

Problem

You have a string or file that contains multiple lines. When a specific character pattern is found on a line, you want to return the entire line, not just the matched text.

Solution

Use the StreamReader.ReadLine method to obtain each line in a file to run a regular expression against, as shown in Example 7-4.

Example 7-4. Returning the entire line in which a match is found
public static List<string> GetLines(string source, string pattern, bool isFileName)
{
    List<string> matchedLines = new List<string>();

    // If this is a file, get the entire file's text.
    if (isFileName)
    {
        using (FileStream FS = new FileStream(source, FileMode.Open,
               FileAccess.Read, FileShare.Read))
        {
            using (StreamReader SR = new StreamReader(FS))
            {
                Regex RE = new Regex(pattern, RegexOptions.Multiline);
                string text = "";
                while (text != null)
                {
                    text = SR.ReadLine();
                    if (text != null)
                    {
                        // Run the regex on each line in the string.
                        if (RE.IsMatch(text))
                        {
                            // Get the line if a match was found.
                            matchedLines.Add(text);
                        }
                    }
                }
            }
        }
    }
    else
    {
        // Run the regex once on the entire string.
        Regex RE = new Regex(pattern, RegexOptions.Multiline);
        MatchCollection theMatches = RE.Matches(source);

        // Use these vars to remember the last line added to matchedLines
        // so that we do not add duplicate lines.
        int lastLineStartPos = -1;
        int lastLineEndPos = -1;

        // Get the line for each match.
        foreach (Match m in theMatches)
        {
            int lineStartPos = GetBeginningOfLine(source, m.Index);
            int lineEndPos = GetEndOfLine(source, (m.Index + m.Length - 1));

            // If this is not a duplicate line, add it.
            if (lastLineStartPos != lineStartPos &&
                lastLineEndPos != lineEndPos)
            {
                string line = source.Substring(lineStartPos,
                                lineEndPos - lineStartPos);
                matchedLines.Add(line);

                // Reset line positions.
                lastLineStartPos = lineStartPos;
                lastLineEndPos = lineEndPos;
            }
        }
    }
    return (matchedLines);
}

public static int GetBeginningOfLine(string text, int startPointOfMatch)
{
       if (startPointOfMatch > 0)
       {
           --startPointOfMatch;
       }

       if (startPointOfMatch >= 0 && startPointOfMatch < text?.Length)
       {
           // Move to the left until the first '
 char is found.
           for (int index = startPointOfMatch; index >= 0; index--)
           {
               if (text?[index] == '
')
               {

                   return (index + 1);
               }
           }

           return (0);
       }

       return (startPointOfMatch);
}

public static int GetEndOfLine(string text, int endPointOfMatch)
{
    if (endPointOfMatch >= 0 && endPointOfMatch < text?.Length)
    {
       // Move to the right until the first '
 char is found.
       for (int index = endPointOfMatch; index < text.Length; index++)
       {
           if (text?[index] == '
')
           {
               return (index);
           }
       }

       return (text.Length);
   }

   return (endPointOfMatch);
}

The following method shows how to call the GetLines method with either a filename or a string:

public static void TestGetLine()
{
    // Get each line within the file TestFile.txt as a separate string.
    Console.WriteLine();
    List<string> lines = GetLines(@"C:TestFile.txt", "Line", true);
    foreach (string s in lines)
        Console.WriteLine($"MatchedLine: {s}");

    // Get the lines matching the text "Line" within the given string.
    Console.WriteLine();
    lines = GetLines("Line1
Line2
Line3
Line4", "Line", false);
    foreach (string s in lines)
        Console.WriteLine($"MatchedLine: {s}");
}

Discussion

The GetLines method accepts three parameters:

source
The string or filename in which to search for a pattern.
pattern
The regular expression pattern to apply to the source string.
isFileName
Pass in true if source is a filename, or false if source is a string.

This method returns a List<string> of strings that contains each line in which the regular expression match was found.

The GetLines method can obtain the lines on which matches occur within a string or a file. When a regular expression is run against a file whose name is passed in to the source parameter (when isFileName equals true) in the GetLines method, the file is opened and read line by line. The regular expression is run against each line, and if a match is found, that line is stored in the matchedLines List<string>. Using the ReadLine method of the StreamReader object saves you from having to determine where each line starts and ends. Determining where a line starts and ends in a string requires some work, as you will see.

Running the regular expression against a string passed in to the source parameter (when isFileName equals false) in the GetLines method produces a MatchCollection. Each Match object in this collection is used to obtain the line on which it is located in the source string. We obtain the line by starting at the position of the first character of the match in the source string and moving one character to the left until either an character or the beginning of the source string is found (this code is found in the GetBeginningOfLine method). This gives you the beginning of the line, which is placed in the variable LineStartPos. Next, we find the end of the line by starting at the last character of the match in the source string and moving to the right until either an character or the end of the source string is found (this code is found in the GetEndOfLine method). This ending position is placed in the LineEndPos variable. All of the text between the LineStartPos and LineEndPos will be the line in which the match is found. Each of these lines is added to the matchedLines List<string> and returned to the caller.

Something interesting you can do with the GetLines method is to pass in the string " " in the pattern parameter of this method. This trick will effectively return each line of the string or file as a string in the List<string>. While this will work with strings that already have the CRLF characters embedded in them, it will not work on text returned from a file. The reason is that the ReadLine method in the preceding GetLines method will strip off the CRLF characters. To fix this we can simply add these characters back in, as we are performing the match in the GetLines method:

// It is necessary to add CRLF chars
// since Readline() strips off these chars
if (RE.IsMatch(text + Environment.NewLine))

Finally, note that if more than one match is found on a line, each matching line will be added to the List<string>.

Warning

Take care when adding line break characters back into the text. If you are using and processing this text exclusively on Windows systems, you won’t have any issues. However, if you are using other systems, or a mix of systems, you need to make sure you add the correct line break characters—that is, for UNIX and OS X, use only the Linefeed character ( ).

See Also

The “.NET Framework Regular Expressions,” “FileStream Class,” and “Stream-Reader Class” topics in the MSDN documentation.

7.6 Finding a Particular Occurrence of a Match

Problem

You need to find a specific occurrence of a match within a string. For example, you want to find the third occurrence of a word or the second occurrence of a Social Security number. In addition, you may need to find every third occurrence of a word in a string.

Solution

To find a particular occurrence of a match in a string, simply subscript the array returned from Regex.Matches:

public static Match FindOccurrenceOf(string source, string pattern,
                                     int occurrence)
{
    if (occurrence < 1)
    {
        throw (new ArgumentException("Cannot be less than 1",
                                     nameof(occurrence)));
    }

    // Make occurrence zero-based.
    --occurrence;

    // Run the regex once on the source string.
    Regex RE = new Regex(pattern, RegexOptions.Multiline);
    MatchCollection theMatches = RE.Matches(source);

    if (occurrence >= theMatches.Count)
    {
        return (null);
    }
    else
    {
        return (theMatches[occurrence]);
    }
}

To find each particular occurrence of a match in a string, build a List<Match> on the fly:

public static List<Match> FindEachOccurrenceOf(string source, string pattern,
                                               int occurrence)
{
    if (occurrence < 1)
    {
        throw (new ArgumentException("Cannot be less than 1",
                                     nameof(occurrence)));
    }

    List<Match> occurrences = new List<Match>();

    // Run the regex once on the source string.
    Regex RE = new Regex(pattern, RegexOptions.Multiline);
    MatchCollection theMatches = RE.Matches(source);

    for (int index = (occurrence - 1); index < theMatches.Count;
         index += occurrence)
    {
        occurrences.Add(theMatches[index]);
    }

    return (occurrences);
}

The following method shows how to invoke the two previous methods:

public static void TestOccurrencesOf()
{
    Match matchResult = FindOccurrenceOf
                        ("one two three one two three one two three one"
                         + " two three one two three one two three", "two", 2);
    Console.WriteLine($"{matchResult?.ToString()}	{matchResult?.Index}");

    Console.WriteLine();
    List<Match> results = FindEachOccurrenceOf
                          ("one one two three one two three one "
                           + " two three one two three", "one", 2);
    foreach (Match m in results)
        Console.WriteLine($"{m.ToString()}	{m.Index}");
}

Discussion

This recipe contains two similar but distinct methods. The first method, FindOccurrenceOf, returns a particular occurrence of a regular expression match. The occurrence you want to find is passed in to this method via the occurrence parameter. If the particular occurrence of the match does not exist—for example, you ask to find the second occurrence, but only one occurrence exists—a null is returned from this method. Because of this, you should check that the returned object of this method is not null before using that object. If the particular occurrence exists, the Match object that holds the match information for that occurrence is returned.

The second method in this recipe, FindEachOccurrenceOf, works similarly to the FindOccurrenceOf method, except that it continues to find a particular occurrence of a regular expression match until the end of the string is reached. For example, if you ask to find the second occurrence, this method would return a List<Match> of zero or more Match objects. The Match objects would correspond to the second, fourth, sixth, and eighth occurrences of a match and so on until the end of the string is reached.

See Also

The “.NET Framework Regular Expressions” and “ArrayList Class” topics in the MSDN documentation.

7.7 Using Common Patterns

Problem

You need a quick list from which to choose regular expression patterns that match standard items. These standard items could be a Social Security number, a zip code, a word containing only characters, an alphanumeric word, an email address, a URL, dates, or one of many other possible items used throughout business applications.

These patterns can be useful in making sure that a user has input the correct data and that it is well formed. These patterns can also be used as an extra security measure to keep hackers from attempting to break your code by entering strange or malformed input (e.g., SQL injection or cross-site-scripting attacks). Note that these regular expressions are not a silver bullet that will stop all attacks on your system; rather, they are an added layer of defense.

Solution

  • Match only alphanumeric characters along with the characters -, +, ., and any whitespace:

    ^([w.+-]|s)*$
    Note

    Be careful using the - (hyphen) character within a character class—that is, a regular expression enclosed within [ and ]. That character is also used to specify a range of characters, as in a-z for “a through z inclusive.” If you want to use a literal - character, either escape it with or put it at the end of the expression, as shown in the next examples.

  • Match only alphanumeric characters along with the characters -, +, ., and any whitespace, with the stipulation that there is at least one of these characters and no more than 10 of these characters:

    ^([w.+-]|s){1,10}$
  • Match a person’s name, up to 55 characters:

    ^[a-zA-Z'-s]{1,55}$
  • Match a positive or negative integer:

    ^(+|-)?d+$
  • Match a positive or negative floating-point number only; this pattern does not match integers:

    ^(+|-)?(d*.d+)$

    Match a floating-point or integer number that can have a positive or negative value:

    ^(+|-)?(d*.)?d+$
  • Match a date in the form ##/##/####, where the day and month can be a one- or two-digit value and the year can only be a four-digit value:

    ^d{1,2}/d{1,2}/d{4}$
  • Verify if the input is a Social Security number of the form ###-##-####:

    ^d{3}-d{2}-d{4}$
  • Match an IPv4 address:

    ^([0-2]?[0-9]?[0-9].){3}[0-2]?[0-9]?[0-9]$
  • Verify that an email address is in the form name@address where address is not an IP address:

    ^[A-Za-z0-9_-.]+@(([A-Za-z0-9-])+.)+([A-Za-z-])+$
  • Verify that an email address is in the form name@address where address is an IP address:

    ^[A-Za-z0-9_-.]+@([0-2]?[0-9]?[0-9].){3}[0-2]?[0-9]?[0-9]$
  • Match or verify a URL that uses either the HTTP, HTTPS, or FTP protocol. Note that this regular expression will not match relative URLs:

    ^(http|https|ftp)://[a-zA-Z0-9-.]+.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?
    ([a-zA-Z0-9-._?,'/\+&%$#=~])*$
  • Match only a dollar amount with the optional $ and + or - preceding characters (note that any number of decimal places may be added):

    ^$?[+-]?[d,]*(.d*)?$

    This is similar to the previous regular expression, except that no more than two decimal places are allowed:

    ^$?[+-]?[d,]*.?d{0,2}$
  • Match a credit card number to be entered as four sets of four digits separated with a space, -, or no character at all:

    ^((d{4}[- ]?){3}d{4})$
  • Match a zip code to be entered as five digits with an optional four-digit extension:

    ^d{5}(-d{4})?$
  • Match a North American phone number with an optional area code and an optional - character to be used in the phone number and no extension:

    ^((?[0-9]{3})?)?-?[0-9]{3}-?[0-9]{4}$
  • Match a phone number similar to the previous regular expression but allow an optional five-digit extension prefixed with either ext or extension:

    ^((?[0-9]{3})?)?-?[0-9]{3}-?[0-9]{4}(s*ext(ension)?[0-9]{5})?$
  • Match a full path beginning with the drive letter and optionally match a filename with a three-character extension (note that no .. characters signifying to move up the directory hierarchy are allowed, nor is a directory name with a . followed by an extension):

    ^[a-zA-Z]:[\/]([_a-zA-Z0-9]+[\/]?)*([_a-zA-Z0-9]+.[_a-zA-Z0-9]{0,3})?$
  • Verify if the input password string matches some specific rules for entering a password (i.e., the password is between 6 and 25 characters in length and contains alphanumeric characters):

    ^(?=.*d)(?=.*[a-z])(?=.*[A-Z]).{6,25}$
  • Determine if any malicious characters were input by the user. Note that this regular expression will not prevent all malicious input, and it also prevents some valid input, such as last names that contain a single quote:

    ^([^)(<>"'\%&+;][(-{2})])*$
  • Extract a tag from an XHTML, HTML, or XML string. This regular expression will return the beginning tag and ending tag, including any attributes of the tag.

    Note that you will need to replace TAGNAME with the real tag name you want to search for:

    <TAGNAME.*?>(.*?)</TAGNAME>
  • Extract a comment line from code. The following regular expression extracts HTML comments from a web page. This can be useful in determining if any HTML comments that are leaking sensitive information need to be removed from your code base before it goes into production:

    <!--.*?-->
  • Match a C# single-line comment:

    //.*$
  • Match a C# multiline comment:

    /*.*?*/
Note

While the four aforementioned regular expressions are great for finding tags and comments, they are not foolproof. To accurately find all tags and comments, you need to use a full parser for the language you are targeting.

Discussion

Regular expressions are effective at finding specific information, and they have a wide range of uses. Many applications use them to locate specific information within a larger range of text, as well as to filter out bad input. The filtering action is very useful in tightening the security of an application and preventing an attacker from attempting to use carefully formed input to gain access to a machine on the Internet or a local network. By using a regular expression to allow only good input to be passed to the application, you can reduce the likelihood of many types of attacks, such as SQL injection or cross-site scripting.

The regular expressions presented in this recipe provide only a small cross-section of what you can accomplish with them. You can easily modify these expressions to suit your needs. Take, for example, the following expression, which allows only between 1 and 10 alphanumeric characters, along with a few symbols, as input:

^([w.+–]|s){1,10}$

By changing the {1,10} part of the regular expression to {0,200}, you can make this expression match a blank entry or an entry of the specified symbols up to and including 200 characters.

Note the use of the ^ character at the beginning of the expression and the $ character at the end of the expression. These characters start the match at the beginning of the text and match all the way to the end of the text. Adding these characters forces the regular expression to match the entire string or none of it. By removing these characters, you can search for specific text within a larger block of text. For example, the following regular expression matches only a string containing nothing but a US zip code (there can be no leading or trailing spaces):

^d{5}(-d{4})?$

This version matches only a zip code with leading or trailing spaces (notice the addition of the s* to the beginning and ending of the expression):

^s*d{5}(-d{4})?s*$

However, this modified expression matches a zip code found anywhere within a string (including a string containing just a zip code):

d{5}(-d{4})?

Use the regular expressions in this recipe and modify them to suit your needs.

See Also

Introducing Regular Expressions by Michael Fitzgerald and Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan (both O’Reilly).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.43.26