Chapter 21
Regular Expressions

What’s in This Chapter

  • Regular expression syntax
  • Using regular expressions to detect matches, find matches, and make replacements
  • Using regular expressions to parse input

Wrox.com Downloads for This Chapter

Please note that all the code examples for this chapter are available as a part of this chapter’s code download on the book’s website at www.wrox.com/go/csharp5programmersref on the Download Code tab.

Many applications enable the user to type information but the information should match some sort of pattern. For example, the string 784-369 is not a valid phone number and Rod@Stephens@C#Helper.com is not a valid e-mail address.

One approach for validating this kind of input is to use string methods. You could use the string class’s IndexOf, LastIndexOf, Substring, and other methods to break the input apart and see if the pieces make sense. For all but the simplest situations, however, that would be a huge amount of work.

Regular expressions provide another method for verifying that the user’s input matches a pattern. A regular expression is a string that contains characters that define a pattern. For example, the regular expression ^d{3}-d{4}$ represents a pattern that matches three digits followed by a hyphen followed by four more digits as in 123-4567. (This isn’t a great pattern for matching U.S. phone numbers because it enables many invalid combinations such as 111-1111 and 000-0000.)

The .NET Framework includes classes that can use regular expressions to see if an input string matches the pattern. They also provide methods for locating patterns within input text and for making complex substitutions.

This chapter provides an introduction to regular expressions. It explains how to create regular expressions and how to use them to see if a complete string matches a pattern, find matches within a string, use patterns to make replacements, and parse inputs.

The following section explains the regular expression syntax used by .NET. The sections after that one explain how to determine whether a string matches a pattern, find matches within a string, and make replacements.

Building Regular Expressions

Before you can write code to see if an input string matches a pattern, you need to know how to build a regular expression to represent that pattern.

A regular expression can contain literal characters that the input must match exactly and characters that have special meanings. For example, the sequence [0-9] means the input must match a single digit 0 through 9.

Regular expressions can also contain special character sequences called escape sequences that match specific patterns or that control the behavior of a regular expression class. For example, the escape sequence d makes the pattern match a single digit just as [0-9] does.

If you want to include a special character such as or [ in a regular expression without it taking on its special meaning, you can “escape it” as in \ or [.

The tools you use to build regular expressions can be divided into six categories: character escapes, character classes, anchors, grouping constructs, quantifiers, and alternation constructs. The following sections describe those categories. The section after that describes some example regular expressions that match common input patterns such as telephone numbers.

Character Escapes

A character escape matches special characters such as [Tab] that you cannot simply type into a string. The following table lists the most useful character escapes.

EscapeMeaning
Matches the tab character
Matches the return character
Matches the newline character
nnMatches a character with ASCII code given by the two or three octal digits nnn
xnnMatches a character with ASCII code given by the two hexadecimal digits nn
unnnnMatches a character with Unicode representation given by the four hexadecimal digits nnnn

For example, the regular expression u00A7 matches the section symbol § (because 00A7 is that character’s hexadecimal Unicode value).

Character Classes

A character class matches one of the items in a set of characters. For example, d matches a digit 0 through 9. The following table lists the most useful character class constructs.

ConstructMeaning
[chars ]Matches one of the characters inside the brackets. For example, [aeiou] matches a single lowercase vowel.
[^chars ]Matches a character that is not inside the brackets. For example, [^aeiouAEIOU] matches a single nonvowel character such as Q, ?, or 3.
[first -last ]Matches a character between the character first and the character last . For example, [a–z] matches any lowercase letter between a and z. You can combine multiple ranges as in [a-zA-Z], which matches uppercase or lowercase letters.
.This is a wildcard that matches any single character except . (To match a period, use the . escape sequence.)
wMatches a single “word” character. Normally, this is equivalent to [a-zA-Z_0-9], so it matches letters, the underscore character, and digits.
WMatches a single nonword character. Normally, this is equivalent to [^a-zA-Z_0-9].
sMatches a single whitespace character. Normally, this includes [Space], [Form feed], [Newline], [Return], [Tab], and [Vertical tab].
SMatches a single nonwhitespace character. Normally, this matches everything except [Space], [Form feed], [Newline], [Return], [Tab], and [Vertical tab].
dMatches a single decimal digit. Normally, this is equivalent to [0-9].
DMatches a single character that is not a decimal digit. Normally, this is equivalent to [^0-9].

For example, the regular expression [A-Z]d[A-Z] d[A-Z]d matches a Canadian postal code of the form A1A 1A1 where A represents a letter and 1 represents a digit.

Anchors

An anchor (also called an atomic zero-width assertion) represents a state that the input string must be in at a certain point to achieve a match. Anchors have a position in the string but do not use up characters.

For example, the ^ and $ characters represent the beginning and ending of a line or the string, depending on whether you work on multiline or single-line input.

The following table lists the most useful anchors.

AnchorMeaning
^Matches the beginning of the line or string
$Matches the end of the string or before the at the end of the line or string
AMatches the beginning of the string
Matches the end of the string or before the at the end of the string
zMatches the end of the string
GMatches where the previous match ended
BMatches a nonword boundary

For more information on these options, see “Regular Expression Options” at msdn.microsoft.com/library/yd1hzczs.aspx.

Grouping Constructs

Grouping constructs enables you to define capture groups within matching pieces of a string. For example, in a U.S. Social Security number with the format 123-45-6789, you could define groups to hold the pieces 123, 45, and 6789. The program could later refer to those groups either with C# code or later inside the same regular expression.

Parentheses create groups. For example, consider the expression (w)1. The parentheses create a numbered group that in this example matches a single word character. Later in the expression, the text 1 refers to group number 1. That means this regular expression matches a word character followed by itself. If the string is “book,” then this pattern would match the “oo” in the middle.

There are several kinds of groups, some of which are fairly specialized and confusing. The two most common are numbered and named groups.

To create a numbered group, simply enclose a subexpression in parentheses as shown in the previous example.

To create a named group, use the syntax (?<name >subexpression ) where name is the name you want to assign to the group and subexpression is a subexpression.

To use a named group in a regular expression, use the syntax k<name >.

For example, the expression (?<twice>w)k<twice> is equivalent to the previous expression (w)1 except the group is named twice.

Quantifiers

A quantifier makes the regular expression engine match the previous element a certain number of times. For example, the expression d{3} matches any digit exactly 3 times. The following table describes regular expression quantifiers.

QuantifierMeaning
*Matches the previous element 0 or more times
+Matches the previous element 1 or more times
?Matches the previous element 0 or 1 times
{n}Matches the previous element exactly n times
{n,}Matches the previous element n or more times
{n,m}Matches the previous element between n and m times (inclusive)

If you follow one of these with ?, the pattern matches the preceding expression as few times as possible. For example, the pattern BO+ matches B followed by 1 or more Os, so it would match the BOO in BOOK. The pattern BO+? also matches a B character followed by 1 or more Os, but it matches as few Os as possible, so it would match only the BO in BOOK.

Alternation Constructs

An alternation construct uses the | character to allow a pattern to match either of two subexpressions. For example, the expression ^(true|yes)$ matches either true or yes.

For a more complicated example, the pattern ^(d{3}-d{4}|d{3}-d{3}-d{4})$ matches 7-digit U.S. phone numbers of the form 123-4567 and 10-digit U.S. phone numbers of the form 123-456-7890.

Sample Regular Expressions

The following list describes several useful regular expressions.

  • ^d{3}-d{4}$

    This is a simple 7-digit phone number format and allows several illegal phone numbers such as 111-1111 and 000-0000.

    ^—Match the start of the string, so the phone number must start at the beginning of the string.

    d—Match any digit.

    {3}—Repeat the previous (match any digit) 3 times. In other words, match 3 digits.

    -—Match the - character.

    d—Match any digit.

    {4}—Match 4 digits.

  • ^[2-9][0-9]{2}-d{4}$

    This matches a 7-digit U.S. phone number more rigorously. The exchange code at the beginning must match the pattern NXX where N is a digit 2-9 and X is any digit 0-9.

  • ^[2-9][0-8]d-[2-9][0-9]{2}-d{4}$

    This pattern matches a 10-digit U.S. phone number with the format NPA-NXX-XXXX where N is a digit 2-9, P is a digit 0-8, A is any digit 0-9, and X is any digit 0-9.

  • ^([2-9][0-8]d-)?[2-9][0-9]{2}-d{4}$

    This pattern matches a U.S. phone number with an optional area code such as 202-234-5678 or 234-5678. The part of the pattern ([2-9][0-8]d-)? matches the area code. The ? at the end means the preceding group can appear 0 or 1 times, so it’s optional. The rest of the pattern is similar to the earlier pattern that matches a 7-digit U.S. phone number.

  • ^d{5}(-d{4})?$

    This pattern matches a U.S. ZIP code with optional +4 as in 12345 or 12345-6789.

  • ^[A-Z]d[A-Z] d[A-Z]d$

    This pattern matches a Canadian postal code with the format A1A 1A1 where A is any capital letter and 1 is any digit.

  • ^[a-zA-Z0-9.-]{3,16}$

    This pattern matches a username with 3 to 16 characters that can be dashes, letters, digits, periods, or underscores. You may need to modify the allowed characters to fit your application.

  • ^[a-zA-Z][a-zA-Z0-9._-]{2,15}$

    This pattern matches a username that includes a letter followed by 2 to 15 dashes, letters, digits, periods, or underscores. You may need to modify the allowed characters to fit your application.

  • ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._%+-]+.[a-zA-Z]{2,4}$

    This pattern matches an e-mail address.

    The sequence [a-zA-Z0-9._%+-] matches letters, digits, underscores, %, +, and . The plus sign after that group means the string must include one or more of those characters.

    Next, the pattern matches the @ symbol.

    The pattern then matches another letter one or more times, followed by a ., and then between two and four letters.

    For example, this pattern matches . This pattern isn’t perfect but it matches most valid e-mail addresses.

  • ^[+-]?[a-fA-F0-9]{3}$

    This pattern matches a 3-digit hexadecimal value with an optional sign + or – as in +A1F.

  • ^(https?://)?([w-]+.)+[w-]+$

    This pattern matches a top-level HTTP web address such as http://www.csharphelper.com.

    The pattern (https?://)? matches http, followed by an s zero or one times, followed by ://. The whole group is followed by ?, so the whole group must appear zero or one times.

    The pattern ([w-]+.)+ matches a word character (letter, digit, or underscore) or dash one or more times followed by a period. This whole group is followed by + so the whole group must appear one or more times.

    The final piece [w-]+ matches one or more letters, digits, underscores, or dashes one or more times.

    This pattern isn’t perfect. In particular it doesn’t validate the final part of the domain, so it would match www.something.whatever.

  • ^(https?://)?([w-]+.)+[w-]+ (/(([w-]+)(.[w-]+)*)*)*$

    This pattern matches an HTTP web URI such as http://www.csharphelper.com/howto_index.html. It starts with the same code used by the preceding pattern. The new part is highlighted in bold.

    The entire new piece is surrounded by parentheses and followed by *, so the whole thing can appear zero or more times.

    The new piece begins with a /, so the text must start with a / character.

    The rest of the new piece is surrounded with parentheses and followed by *, so that part can appear zero or more times. This allows the URL to end with a /.

    The rest of the pattern is ([w-]+)(.[w-]+)*). The first part ([w-]+) requires the string to include one or more letters, digits, underscores, or dashes. The second part (.[w-]+)*) requires the string to contain a period followed by one or more letters, digits, underscores, or dashes. This second part is followed by *, so it can appear zero or more times. (Basically this piece means the URL can include characters separated by periods but it cannot end with a period.)

    Again, this pattern isn’t perfect and doesn’t handle some more advanced URLs such as those that include =, ?, and # characters, but it does handle many typical URLs.

Notice that all these patterns begin with the beginning-of-line anchor ^ and end with the end-of-line anchor $, so each pattern matches the entire string not just part of it. For example, the pattern ^d{5}(-d{4})?$ matches complete strings that look like ZIP codes such as 12345. Without the ^ and $, the pattern would match strings that contain a string that looks like a ZIP code such as x12345x.

Using Regular Expressions

The Regex class provides objects that you can use to work with regular expressions. The following table summarizes the Regex class’s most useful methods.

MethodPurpose
IsMatchReturns true if a string satisfies a regular expression.
MatchSearches a string for the first part of it that satisfies a regular expression.
MatchesReturns a collection giving information about all parts of a string that satisfy a regular expression.
ReplaceReplaces some or all the parts of the string that match a regular expression with a new value. (This is much more powerful than the string class’s Replace method.)
SplitSplits a string into an array of substrings delimited by pieces of the string that match a regular expression.

Many of the methods described in the table have multiple overloaded versions. In particular, many take a string as a parameter and can optionally take another parameter that gives a regular expression. If you don’t pass the method a regular expression, then the method uses the expression you passed into the object’s constructor.

The Regex class also provides static versions of these methods that take both a string and a regular expression as parameters. For example, the following code determines whether the text in inputTextBox satisfies the regular expression in patternTextBox.

if (Regex.IsMatch(inputTextBox.Text, patternTextBox.Text))
    resultLabel.Text = "Match";
else
    resultLabel.Text = "No match";

The static methods make simple regular expression testing easy.

The following sections explain how to use the Regex class to perform common regular expression tasks such as finding matches, making replacements, and parsing input strings.

Matching Patterns

The Regex class’s static IsMatch method gives you an easy way to determine whether a string satisfies a regular expression. The MatchPattern example program, which is available for download and shown in Figure 21-1, uses this method to determine whether a string matches a pattern.

c21f001.tif

Figure 21-1: The MatchPattern example program determines whether a string satisfies a regular expression.

When you modify the regular expression or the string, the program executes the following code.

// See if the text matches the pattern.
private void CheckForMatch()
{
    try
    {
        if (Regex.IsMatch(inputTextBox.Text, patternTextBox.Text))
            resultLabel.Text = "Match";
        else
            resultLabel.Text = "No match";
    }
    catch (Exception ex)
    {
        resultLabel.Text = ex.Message;
    }
}

The code passes the Regex.IsMatch method the string to validate and the regular expression. The method returns true if the string satisfies the expression. The program then displays an appropriate message in resultLabel.

This example uses a try catch block to protect itself from improperly formed regular expressions. For example, suppose you want to use the expression (.)1 to detect repeated characters. At one point while you’re typing you will have entered just (, which is not a valid regular expression.

Finding Matches

The MatchPattern example program described in the preceding section determines whether a string satisfies a regular expression. For example, it can use the pattern (.)1 to verify that the string bookkeeper contains a double letter. However, that program won’t tell you where the double letter is. In this example, bookkeeper contains three double letters: oo, kk, and ee.

The Regex class’s Matches method can give you information about places where a string matches a regular expression. The FindMatches example program, which is available for download and shown in Figure 21-2, displays the parts of a string that match a pattern.

c21f002.tif

Figure 21-2: The FindMatches example program finds the parts of a string that match a pattern.

The FindMatches program uses the following code to locate matches.

// Display matches.
private void FindMatches()
{
    try
    {
        // Make the regex object.
        Regex regex = new Regex(patternTextBox.Text);

        // Find the matches.
        string matches = "";
        foreach (Match match in regex.Matches(inputTextBox.Text))
        {
            // Display the matches.
            matches += match.Value +" ";
        }
        resultLabel.Text = matches;
    }
    catch (Exception ex)
    {
        resultLabel.Text = ex.Message;
    }
}

This code creates a Regex object, passing its constructor the regular expression pattern. It then calls the object’s Matches method, passing it the input string. It loops through the resulting collection of Match objects and adds each match’s Value to a result string. When it is finished, it displays the results in resultLabel.

The following table lists the Match class’s most useful properties.

PropertyPurpose
GroupsReturns a collection of objects representing any groups captured by the regular expression. The Group class has Index, Length, and Value properties that describe the group.
IndexThe index of the match’s first character.
LengthThe length of the text represented by this match.
ValueThe text represented by this match.

Making Replacements

The Regex class’s static Replace method lets you replace the parts of a string that match a pattern with a new string. The MakeReplacements example program, which is available for download and shown in Figure 21-3, replaces parts of a string that match a pattern with a new string.

c21f003.tif

Figure 21-3: The MakeReplacements example program replaces matching parts of a string with a new value.

The following code shows how the MakeReplacements program makes replacements.

// Make the replacements.
private void replaceButton_Click(object sender, EventArgs e)
{
    resultTextBox.Text = Regex.Replace(
        inputTextBox.Text,
        patternTextBox.Text,
        replaceWithTextBox.Text);
}

This code simply calls the Replace method, passing it the input string, the pattern to match, and the replacement text.

Parsing Input

In some situations, you can use a Regex object to parse an input string by using capture groups. After matching a string, you can loop through a Regex object’s Matches collection to find parts of the string that matched the expression. You can then use each Match’s Groups collection, indexed by number or name (if the groups are named), to get the pieces of the match in the groups.

The ParsePhoneNumber example program, which is available for download and shown in Figure 21-4, uses a Regex object to find the pieces of a 10-digit phone number.

c21f004.tif

Figure 21-4: The ParsePhoneNumber example program parses out the pieces of a phone number.

The ParsePhoneNumber program uses the following code to find the phone number’s pieces.

// Find matching groups.
private void parseButton_Click(object sender, EventArgs e)
{
    groupsListBox.Items.Clear();

    Regex regex = new Regex(patternTextBox.Text);
    foreach (Match match in regex.Matches(inputTextBox.Text))
    {
        groupsListBox.Items.Add("NPA:  " + match.Groups["NPA"]);
        groupsListBox.Items.Add("NXX:  " + match.Groups["NXX"]);
        groupsListBox.Items.Add("XXXX: " + match.Groups["XXXX"]);
    }
}

This code clears its ListBox and then creates a Regex object, passing its constructor the pattern to match. It then calls the Matches method, passing it the input string, and loops through the resulting collection of matches.

If the input string contains a single phone number, there will be only one match in the Matches collection. In fact, the pattern shown in Figure 21-4 can contain at most one match.

For each match, the code uses the Groups collection to get the text in the named groups NPA, NXX, and XXXX. It adds the values it finds to the result ListBox.

Summary

The string class provides methods that you can use to examine a string to see if it matches a particular pattern. Unfortunately, using those methods can be a lot of work if the pattern is complicated.

The Regex class provides another approach that is often much easier. Instead of breaking a string apart to see if it matches a pattern, the Regex class lets you define a regular expression and then see whether the string satisfies the expression. The class’s methods also let you find parts of a string that match the expression, make replacements, and parse a string to find pieces that match parts of a regular expression.

Chapter 8, “LINQ,” described PLINQ, which lets you perform multiple LINQ queries in parallel. PLINQ isn’t the only way a program can perform multiple tasks at the same time. The following chapter explains other methods in which you can use parallelism to improve performance on multiple CPU or multicore systems.

Exercises

  1. Write a program that takes an input string, removes leading and trailing whitespace, and replaces sequences of multiple whitespace characters with single spaces. For example, with the input string “ This is a test. ” the program should produce the result “This is a test.”
  2. Write a program that uses Regex.Split to list the words in a string.
  3. Modify the FindMatches example program shown in Figure 21-2 so that it displays matches in a ListBox. Then run the program with the regular expression w* and the input string abc;def;;;ghi;;. Why doesn’t the program display only the three strings abc, def, and ghi? How can you make it display only those strings?
  4. The ParsePhoneNumber example program shown in Figure 21-4 cannot match a phone number that is missing the area code. It can parse 800-555-1337 but cannot parse 555-1337. What regular expression can you use to parse phone numbers with or without an area code? What does the program do if a phone number is missing the area code?
  5. Modify the ParsePhoneNumber example program so that it parses multiple phone numbers in a multiline string. Display each phone number’s pieces on a separate line in the ListBox as in NPA: 800, NXX: 555, XXXX: 1337.

    Hints: When you create the Regex object, use the Multiline option. That makes the ^ and $ characters match the start and end of each line instead of the entire string. Also note that the Regex object considers [Newline] to mark the end of a line but a TextBox uses the [Return][Newline] combination to mark the end of a line. Modify the regular expression so it can remove the [Return] characters as needed.

  6. Write a regular expression that matches integers with digit grouping such as –1,234 and +91. Use the MatchPattern example program shown in Figure 21-1 to check your result.
  7. Write a regular expression that matches floating-point numbers such as –1234.56 and +65.43. If a number has a decimal point, require at least one digit after it.
  8. Write a regular expression that matches floating-point numbers with digit grouping such as –1,234.5678.
  9. Replacement patterns let the Regex.Replace method use groups and other parts of a match in replacement text. For example, within the replacement text, you can use $1 to represent the text in the first match group.

    Write a program that lets you type a series of names of the form Ann Archer. When you click the Rearrange button, make the program use replacement patterns to rewrite the names in the form Archer, Ann.

    Hints: Remember to remove the trailing [Return] character from each line. Use replacement patterns to restore it in the result so each name appears on a separate line.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.198.61