What’s in This Chapter
Wrox.com Downloads for This Chapter
Please note that all the code examples for this chapter are available as a part of this chapter’s code download on the book’s website at www.wrox.com/go/csharp5programmersref on the Download Code tab.
Many applications enable the user to type information but the information should match some sort of pattern. For example, the string 784-369 is not a valid phone number and Rod@Stephens@C#Helper.com is not a valid e-mail address.
One approach for validating this kind of input is to use string methods. You could use the string
class’s IndexOf
, LastIndexOf
, Substring
, and other methods to break the input apart and see if the pieces make sense. For all but the simplest situations, however, that would be a huge amount of work.
Regular expressions provide another method for verifying that the user’s input matches a pattern. A regular expression is a string that contains characters that define a pattern. For example, the regular expression ^d{3}-d{4}$
represents a pattern that matches three digits followed by a hyphen followed by four more digits as in 123-4567. (This isn’t a great pattern for matching U.S. phone numbers because it enables many invalid combinations such as 111-1111 and 000-0000.)
The .NET Framework includes classes that can use regular expressions to see if an input string matches the pattern. They also provide methods for locating patterns within input text and for making complex substitutions.
This chapter provides an introduction to regular expressions. It explains how to create regular expressions and how to use them to see if a complete string matches a pattern, find matches within a string, use patterns to make replacements, and parse inputs.
The following section explains the regular expression syntax used by .NET. The sections after that one explain how to determine whether a string matches a pattern, find matches within a string, and make replacements.
Before you can write code to see if an input string matches a pattern, you need to know how to build a regular expression to represent that pattern.
A regular expression can contain literal characters that the input must match exactly and characters that have special meanings. For example, the sequence [0-9]
means the input must match a single digit 0 through 9.
Regular expressions can also contain special character sequences called escape sequences that match specific patterns or that control the behavior of a regular expression class. For example, the escape sequence d
makes the pattern match a single digit just as [0-9]
does.
If you want to include a special character such as or
[
in a regular expression without it taking on its special meaning, you can “escape it” as in \
or [
.
The tools you use to build regular expressions can be divided into six categories: character escapes, character classes, anchors, grouping constructs, quantifiers, and alternation constructs. The following sections describe those categories. The section after that describes some example regular expressions that match common input patterns such as telephone numbers.
A character escape matches special characters such as [Tab]
that you cannot simply type into a string. The following table lists the most useful character escapes.
Escape | Meaning |
| Matches the tab character |
| Matches the return character |
| Matches the newline character |
nn | Matches a character with ASCII code given by the two or three octal digits nnn |
xnn | Matches a character with ASCII code given by the two hexadecimal digits nn |
unnnn | Matches a character with Unicode representation given by the four hexadecimal digits nnnn |
For example, the regular expression u00A7
matches the section symbol § (because 00A7 is that character’s hexadecimal Unicode value).
A character class matches one of the items in a set of characters. For example, d
matches a digit 0 through 9. The following table lists the most useful character class constructs.
Construct | Meaning |
[ chars
] | Matches one of the characters inside the brackets. For example, [aeiou] matches a single lowercase vowel. |
[^ chars
] | Matches a character that is not inside the brackets. For example, [^aeiouAEIOU] matches a single nonvowel character such as Q , ? , or 3 . |
[ first
- last
] | Matches a character between the character first
and the character last
. For example, [a–z] matches any lowercase letter between a and z . You can combine multiple ranges as in [a-zA-Z] , which matches uppercase or lowercase letters. |
. | This is a wildcard that matches any single character except
. (To match a period, use the . escape sequence.) |
w | Matches a single “word” character. Normally, this is equivalent to [a-zA-Z_0-9] , so it matches letters, the underscore character, and digits. |
W | Matches a single nonword character. Normally, this is equivalent to [^a-zA-Z_0-9] . |
s | Matches a single whitespace character. Normally, this includes [Space] , [Form feed] , [Newline] , [Return] , [Tab] , and [Vertical tab] . |
S | Matches a single nonwhitespace character. Normally, this matches everything except [Space] , [Form feed] , [Newline] , [Return] , [Tab] , and [Vertical tab] . |
d | Matches a single decimal digit. Normally, this is equivalent to [0-9] . |
D | Matches a single character that is not a decimal digit. Normally, this is equivalent to [^0-9] . |
For example, the regular expression [A-Z]d[A-Z] d[A-Z]d
matches a Canadian postal code of the form A1A 1A1 where A represents a letter and 1 represents a digit.
An anchor (also called an atomic zero-width assertion) represents a state that the input string must be in at a certain point to achieve a match. Anchors have a position in the string but do not use up characters.
For example, the ^
and $
characters represent the beginning and ending of a line or the string, depending on whether you work on multiline or single-line input.
The following table lists the most useful anchors.
Anchor | Meaning |
^ | Matches the beginning of the line or string |
$ | Matches the end of the string or before the
at the end of the line or string |
A | Matches the beginning of the string |
| Matches the end of the string or before the
at the end of the string |
z | Matches the end of the string |
G | Matches where the previous match ended |
B | Matches a nonword boundary |
For more information on these options, see “Regular Expression Options” at msdn.microsoft.com/library/yd1hzczs.aspx.
Grouping constructs enables you to define capture groups within matching pieces of a string. For example, in a U.S. Social Security number with the format 123-45-6789, you could define groups to hold the pieces 123, 45, and 6789. The program could later refer to those groups either with C# code or later inside the same regular expression.
Parentheses create groups. For example, consider the expression (w)1
. The parentheses create a numbered group that in this example matches a single word character. Later in the expression, the text 1
refers to group number 1. That means this regular expression matches a word character followed by itself. If the string is “book,” then this pattern would match the “oo” in the middle.
There are several kinds of groups, some of which are fairly specialized and confusing. The two most common are numbered and named groups.
To create a numbered group, simply enclose a subexpression in parentheses as shown in the previous example.
To create a named group, use the syntax (?<
name
>
subexpression
)
where name
is the name you want to assign to the group and subexpression
is a subexpression.
To use a named group in a regular expression, use the syntax k<
name
>
.
For example, the expression (?<twice>w)k<twice>
is equivalent to the previous expression (w)1
except the group is named twice
.
A quantifier makes the regular expression engine match the previous element a certain number of times. For example, the expression d{3}
matches any digit exactly 3 times. The following table describes regular expression quantifiers.
Quantifier | Meaning |
* | Matches the previous element 0 or more times |
+ | Matches the previous element 1 or more times |
? | Matches the previous element 0 or 1 times |
{n} | Matches the previous element exactly n times |
{n,} | Matches the previous element n or more times |
{n,m} | Matches the previous element between n and m times (inclusive) |
If you follow one of these with ?
, the pattern matches the preceding expression as few times as possible. For example, the pattern BO+
matches B
followed by 1 or more O
s, so it would match the BOO
in BOOK
. The pattern BO+?
also matches a B
character followed by 1 or more O
s, but it matches as few O
s as possible, so it would match only the BO
in BOOK
.
An alternation construct uses the |
character to allow a pattern to match either of two subexpressions. For example, the expression ^(true|yes)$
matches either true
or yes
.
For a more complicated example, the pattern ^(d{3}-d{4}|d{3}-d{3}-d{4})$
matches 7-digit U.S. phone numbers of the form 123-4567 and 10-digit U.S. phone numbers of the form 123-456-7890.
The following list describes several useful regular expressions.
^d{3}-d{4}$
This is a simple 7-digit phone number format and allows several illegal phone numbers such as 111-1111 and 000-0000.
^
—Match the start of the string, so the phone number must start at the beginning of the string.
d
—Match any digit.
{3}
—Repeat the previous (match any digit) 3 times. In other words, match 3 digits.
-
—Match the -
character.
d
—Match any digit.
{4}
—Match 4 digits.
^[2-9][0-9]{2}-d{4}$
This matches a 7-digit U.S. phone number more rigorously. The exchange code at the beginning must match the pattern NXX
where N
is a digit 2-9 and X
is any digit 0-9.
^[2-9][0-8]d-[2-9][0-9]{2}-d{4}$
This pattern matches a 10-digit U.S. phone number with the format NPA-NXX-XXXX where N
is a digit 2-9, P
is a digit 0-8, A
is any digit 0-9, and X is any digit 0-9.
^([2-9][0-8]d-)?[2-9][0-9]{2}-d{4}$
This pattern matches a U.S. phone number with an optional area code such as 202-234-5678 or 234-5678. The part of the pattern ([2-9][0-8]d-)?
matches the area code. The ?
at the end means the preceding group can appear 0 or 1 times, so it’s optional. The rest of the pattern is similar to the earlier pattern that matches a 7-digit U.S. phone number.
^d{5}(-d{4})?$
This pattern matches a U.S. ZIP code with optional +4 as in 12345 or 12345-6789.
^[A-Z]d[A-Z] d[A-Z]d$
This pattern matches a Canadian postal code with the format A1A 1A1
where A
is any capital letter and 1
is any digit.
^[a-zA-Z0-9.-]{3,16}$
This pattern matches a username with 3 to 16 characters that can be dashes, letters, digits, periods, or underscores. You may need to modify the allowed characters to fit your application.
^[a-zA-Z][a-zA-Z0-9._-]{2,15}$
This pattern matches a username that includes a letter followed by 2 to 15 dashes, letters, digits, periods, or underscores. You may need to modify the allowed characters to fit your application.
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._%+-]+.[a-zA-Z]{2,4}$
This pattern matches an e-mail address.
The sequence [a-zA-Z0-9._%+-]
matches letters, digits, underscores, %
, +
, and –
. The plus sign after that group means the string must include one or more of those characters.
Next, the pattern matches the @
symbol.
The pattern then matches another letter one or more times, followed by a .
, and then between two and four letters.
For example, this pattern matches [email protected]. This pattern isn’t perfect but it matches most valid e-mail addresses.
^[+-]?[a-fA-F0-9]{3}$
This pattern matches a 3-digit hexadecimal value with an optional sign + or – as in +A1F.
^(https?://)?([w-]+.)+[w-]+$
This pattern matches a top-level HTTP web address such as http://www.csharphelper.com.
The pattern (https?://)?
matches http
, followed by an s
zero or one times, followed by ://
. The whole group is followed by ?
, so the whole group must appear zero or one times.
The pattern ([w-]+.)+
matches a word character (letter, digit, or underscore) or dash one or more times followed by a period. This whole group is followed by +
so the whole group must appear one or more times.
The final piece [w-]+
matches one or more letters, digits, underscores, or dashes one or more times.
This pattern isn’t perfect. In particular it doesn’t validate the final part of the domain, so it would match www.something.whatever.
^(https?://)?([w-]+.)+[w-]+
(/(([w-]+)(.[w-]+)*)*)*$
This pattern matches an HTTP web URI such as http://www.csharphelper.com/howto_index.html. It starts with the same code used by the preceding pattern. The new part is highlighted in bold.
The entire new piece is surrounded by parentheses and followed by *
, so the whole thing can appear zero or more times.
The new piece begins with a /
, so the text must start with a /
character.
The rest of the new piece is surrounded with parentheses and followed by *
, so that part can appear zero or more times. This allows the URL to end with a /
.
The rest of the pattern is ([w-]+)(.[w-]+)*)
. The first part ([w-]+)
requires the string to include one or more letters, digits, underscores, or dashes. The second part (.[w-]+)*)
requires the string to contain a period followed by one or more letters, digits, underscores, or dashes. This second part is followed by *
, so it can appear zero or more times. (Basically this piece means the URL can include characters separated by periods but it cannot end with a period.)
Again, this pattern isn’t perfect and doesn’t handle some more advanced URLs such as those that include =
, ?
, and #
characters, but it does handle many typical URLs.
Notice that all these patterns begin with the beginning-of-line anchor ^
and end with the end-of-line anchor $
, so each pattern matches the entire string not just part of it. For example, the pattern ^d{5}(-d{4})?$
matches complete strings that look like ZIP codes such as 12345. Without the ^
and $
, the pattern would match strings that contain a string that looks like a ZIP code such as x12345x.
The Regex
class provides objects that you can use to work with regular expressions. The following table summarizes the Regex
class’s most useful methods.
Method | Purpose |
IsMatch | Returns true if a string satisfies a regular expression. |
Match | Searches a string for the first part of it that satisfies a regular expression. |
Matches | Returns a collection giving information about all parts of a string that satisfy a regular expression. |
Replace | Replaces some or all the parts of the string that match a regular expression with a new value. (This is much more powerful than the string class’s Replace method.) |
Split | Splits a string into an array of substrings delimited by pieces of the string that match a regular expression. |
Many of the methods described in the table have multiple overloaded versions. In particular, many take a string
as a parameter and can optionally take another parameter that gives a regular expression. If you don’t pass the method a regular expression, then the method uses the expression you passed into the object’s constructor.
The Regex
class also provides static versions of these methods that take both a string
and a regular expression as parameters. For example, the following code determines whether the text in inputTextBox
satisfies the regular expression in patternTextBox
.
if (Regex.IsMatch(inputTextBox.Text, patternTextBox.Text))
resultLabel.Text = "Match";
else
resultLabel.Text = "No match";
The static methods make simple regular expression testing easy.
The following sections explain how to use the Regex
class to perform common regular expression tasks such as finding matches, making replacements, and parsing input strings.
The Regex
class’s static IsMatch
method gives you an easy way to determine whether a string satisfies a regular expression. The MatchPattern example program, which is available for download and shown in Figure 21-1, uses this method to determine whether a string matches a pattern.
Figure 21-1: The MatchPattern example program determines whether a string satisfies a regular expression.
When you modify the regular expression or the string, the program executes the following code.
// See if the text matches the pattern.
private void CheckForMatch()
{
try
{
if (Regex.IsMatch(inputTextBox.Text, patternTextBox.Text))
resultLabel.Text = "Match";
else
resultLabel.Text = "No match";
}
catch (Exception ex)
{
resultLabel.Text = ex.Message;
}
}
The code passes the Regex.IsMatch
method the string to validate and the regular expression. The method returns true
if the string satisfies the expression. The program then displays an appropriate message in resultLabel
.
This example uses a try catch
block to protect itself from improperly formed regular expressions. For example, suppose you want to use the expression (.)1
to detect repeated characters.
At one point while you’re typing you will have entered just (
, which is not a valid regular expression.
The MatchPattern example program described in the preceding section determines whether a string satisfies a regular expression. For example, it can use the pattern (.)1
to verify that the string bookkeeper
contains a double letter. However, that program won’t tell you where the double letter is. In this example, bookkeeper
contains three double letters: oo
, kk
, and ee
.
The Regex
class’s Matches
method can give you information about places where a string matches a regular expression. The FindMatches example program, which is available for download and shown in Figure 21-2, displays the parts of a string that match a pattern.
Figure 21-2: The FindMatches example program finds the parts of a string that match a pattern.
The FindMatches program uses the following code to locate matches.
// Display matches.
private void FindMatches()
{
try
{
// Make the regex object.
Regex regex = new Regex(patternTextBox.Text);
// Find the matches.
string matches = "";
foreach (Match match in regex.Matches(inputTextBox.Text))
{
// Display the matches.
matches += match.Value +" ";
}
resultLabel.Text = matches;
}
catch (Exception ex)
{
resultLabel.Text = ex.Message;
}
}
This code creates a Regex
object, passing its constructor the regular expression pattern. It then calls the object’s Matches
method, passing it the input string. It loops through the resulting collection of Match
objects and adds each match’s Value
to a result string. When it is finished, it displays the results in resultLabel
.
The following table lists the Match
class’s most useful properties.
Property | Purpose |
Groups | Returns a collection of objects representing any groups captured by the regular expression. The Group class has Index , Length , and Value properties that describe the group. |
Index | The index of the match’s first character. |
Length | The length of the text represented by this match. |
Value | The text represented by this match. |
The Regex
class’s static Replace
method lets you replace the parts of a string that match a pattern with a new string. The MakeReplacements example program, which is available for download and shown in Figure 21-3, replaces parts of a string that match a pattern with a new string.
Figure 21-3: The MakeReplacements example program replaces matching parts of a string with a new value.
The following code shows how the MakeReplacements program makes replacements.
// Make the replacements.
private void replaceButton_Click(object sender, EventArgs e)
{
resultTextBox.Text = Regex.Replace(
inputTextBox.Text,
patternTextBox.Text,
replaceWithTextBox.Text);
}
This code simply calls the Replace
method, passing it the input string, the pattern to match, and the replacement text.
In some situations, you can use a Regex
object to parse an input string by using capture groups. After matching a string, you can loop through a Regex
object’s Matches
collection to find parts of the string that matched the expression. You can then use each Match
’s Groups
collection, indexed by number or name (if the groups are named), to get the pieces of the match in the groups.
The ParsePhoneNumber example program, which is available for download and shown in Figure 21-4, uses a Regex
object to find the pieces of a 10-digit phone number.
Figure 21-4: The ParsePhoneNumber example program parses out the pieces of a phone number.
The ParsePhoneNumber program uses the following code to find the phone number’s pieces.
// Find matching groups.
private void parseButton_Click(object sender, EventArgs e)
{
groupsListBox.Items.Clear();
Regex regex = new Regex(patternTextBox.Text);
foreach (Match match in regex.Matches(inputTextBox.Text))
{
groupsListBox.Items.Add("NPA: " + match.Groups["NPA"]);
groupsListBox.Items.Add("NXX: " + match.Groups["NXX"]);
groupsListBox.Items.Add("XXXX: " + match.Groups["XXXX"]);
}
}
This code clears its ListBox
and then creates a Regex
object, passing its constructor the pattern to match. It then calls the Matches
method, passing it the input string, and loops through the resulting collection of matches.
If the input string contains a single phone number, there will be only one match in the Matches
collection. In fact, the pattern shown in Figure 21-4 can contain at most one match.
For each match, the code uses the Groups
collection to get the text in the named groups NPA
, NXX
, and XXXX
. It adds the values it finds to the result ListBox
.
The string
class provides methods that you can use to examine a string to see if it matches a particular pattern. Unfortunately, using those methods can be a lot of work if the pattern is complicated.
The Regex
class provides another approach that is often much easier. Instead of breaking a string apart to see if it matches a pattern, the Regex
class lets you define a regular expression and then see whether the string satisfies the expression. The class’s methods also let you find parts of a string that match the expression, make replacements, and parse a string to find pieces that match parts of a regular expression.
Chapter 8, “LINQ,” described PLINQ, which lets you perform multiple LINQ queries in parallel. PLINQ isn’t the only way a program can perform multiple tasks at the same time. The following chapter explains other methods in which you can use parallelism to improve performance on multiple CPU or multicore systems.
Regex.Split
to list the words in a string.ListBox
. Then run the program with the regular expression w*
and the input string abc;def;;;ghi;;
. Why doesn’t the program display only the three strings abc
, def
, and ghi
? How can you make it display only those strings?ListBox
as in NPA: 800, NXX: 555, XXXX: 1337
.
Hints: When you create the Regex
object, use the Multiline
option. That makes the ^
and $
characters match the start and end of each line instead of the entire string. Also note that the Regex
object considers [Newline]
to mark the end of a line but a TextBox
uses the [Return][Newline]
combination to mark the end of a line. Modify the regular expression so it can remove the [Return]
characters as needed.
Regex.Replace
method use groups and other parts of a match in replacement text. For example, within the replacement text, you can use $1
to represent the text in the first match group.
Write a program that lets you type a series of names of the form Ann Archer. When you click the Rearrange button, make the program use replacement patterns to rewrite the names in the form Archer, Ann
.
Hints: Remember to remove the trailing [Return]
character from each line. Use replacement patterns to restore it in the result so each name appears on a separate line.
18.118.198.61