Regular expressions are a powerful language for describing and manipulating text. A regular expression is applied to a string—that is, to a set of characters. Often that string is an entire text document.
The result of applying a regular expression to a string is either to return a substring, or to return a new string representing a modification of some part of the original string. Remember that strings are immutable and so cannot be changed by the regular expression.
By applying a properly constructed regular expression to the following string:
One,Two,Three Liberty Associates, Inc.
you can return any or all of its substrings (e.g.,
Liberty
or One
), or modified
versions of its substrings (e.g., LIBeRtY
or
OnE
). What the regular expression
does is determined by the syntax of the regular
expression itself.
A regular expression consists of two types of characters:
literals
and
metacharacters
.
A literal is just a character you wish to match in the target string.
A metacharacter is a special symbol which acts as a command to the
regular expression parser. The parser is the engine responsible for
understanding the regular expression. For example, if you create a
regular expression:
^(From|To|Subject|Date):
this will match any substring with the letters
"From
" or the letters
"To
" or the letters
"Subject
" or the letters
"Date
" so long as those letters start
a new line (^
) and end with
a
colon
(:
).
The
carrot
(^)
in this case indicates to the regular
expression parser that the string you’re searching for must
begin a new line. The letters "From
"
and "To
" are literals, and the
metacharacters left and right
parentheses (
(
, )
) and
vertical
bar (|)
are all used to group sets of literals and
indicate that any of the choices should match. (Note that
^
is a metacharacter as well, used to indicate the
start of the line.)
Thus you would read this line:
^(From|To|Subject|Date):
as follows: “match any string which begins a new line followed
by any of the four literal strings From
,
To
, Subject
, or
Date
followed by a colon.”
A full explanation of regular expressions is beyond the scope of this book, but all the regular expressions used in the examples are explained. For a complete understanding of regular expressions, I highly recommend Mastering Regular Expressions by Jeffrey E. F. Friedl (published by O’Reilly & Associates, Inc.).
The .NET Framework provides an object-oriented approach to regular expression matching and replacement.
C#’s regular expressions are based on Perl5 regexp, including lazy quantifiers (??, *?, +?, {n,m}?), positive and negative look ahead, and conditional evaluation.
The Base Class Library namespace
System.Text.RegularExpressions
is the home to all the .NET Framework
objects associated with regular expressions. The central class for
regular expression support is Regex
, which
represents an immutable, compiled regular expression. Although
instances of Regex
can be created, the class also
provides a number of useful static methods. The use of
Regex
is illustrated in Example 10-5.
Example 10-5. Using the Regex class for regular expressions
namespace Programming_CSharp
{
using System;
using System.Text;
using System.Text.RegularExpressions;
public class Tester
{
static void Main( )
{
string s1 =
"One, Two, Three Liberty Associates, Inc.";
Regex theRegex = new Regex(" |, ");
StringBuilder sBuilder = new StringBuilder( );
int id = 1;
foreach (string subString in theRegex.Split(s1))
{
sBuilder.AppendFormat(
"{0}: {1}
", id++, subString);
}
Console.WriteLine("{0}", sBuilder);
}
}
}
Output:
1: One
2: Two
3: Three
4: Liberty
5: Associates
6: Inc.
Example 10-5 begins by creating a string,
s1
, identical to the string used in Example 10-4.
string s1 = "One, Two, Three Liberty Associates, Inc.";
and a regular expression, which will be used to search that string:
Regex theRegex = new Regex(" |, ");
One of the overloaded constructors for Regex
takes
a regular expression string as its parameter. This is a bit
confusing. In the context of a C# program, which is the regular
expression: the text passed in to the constructor, or the
Regex
object itself? It is true that the text
string passed to the constructor is a regular expression in the
traditional sense of the term. From an object-oriented C# point of
view, however, the argument to the constructor is just a string of
characters; it is theRegex
that is the regular
expression object.
The rest of the program proceeds like the earlier Example 10-4; except that rather than calling
Split( )
on string s1, the Split( )
method of Regex
is called. Regex.Split( )
acts in much the same way as String.Split( )
, returning an array of strings as a result of matching
the regular expression pattern within theRegex
.
Regex.Split( )
is overloaded. The simplest version
is called on an instance of Regex
as shown in
Example 10-5. There is also a static version of this
method, which takes a string to search and the pattern to search
with, as illustrated in Example 10-6.
Example 10-6. Using static Regex.Split( )
namespace Programming_CSharp
{
using System;
using System.Text;
using System.Text.RegularExpressions;
public class Tester
{
static void Main( )
{
string s1 =
"One, Two, Three Liberty Associates, Inc.";
StringBuilder sBuilder = new StringBuilder( );
int id = 1;
foreach (string subStr in Regex.Split(s1," |, "))
{
sBuilder.AppendFormat("{0}: {1}
", id++, subStr);
}
Console.WriteLine("{0}", sBuilder);
}
}
}
Example 10-6 is identical to Example 10-5, except that the latter example does not
instantiate an object of type Regex
. Instead,
Example 10-6 uses the static version of
Split( )
, which takes two arguments: a string to
be searched and a regular expression string that represents the
pattern to match.
The instance method of Split( )
is also overloaded
with versions that limit the number of times the split will occur and
also that determine the position within the target string where the
search will
begin.
Two additional
classes in the
.NET
RegularExpressions
namespace allow you to search a string repeatedly, and to return the
results in a collection. The collection returned is of type
MatchCollection
, which consists of zero or more
Match
objects. Two important properties of a
Match
object are its length and its value, each of
which can be read as illustrated in Example 10-7.
Example 10-7. Using MatchCollection and Match
namespace Programming_CSharp
{
using System;
using System.Text.RegularExpressions;
class Test
{
public static void Main( )
{
string string1 = "This is a test string";
// find any nonwhitespace followed by whitespace
Regex theReg = new Regex(@"(S+)s");
// get the collection of matches
MatchCollection theMatches =
theReg.Matches(string1);
// iterate through the collection
foreach (Match theMatch in theMatches)
{
Console.WriteLine(
"theMatch.Length: {0}", theMatch.Length);
if (theMatch.Length != 0)
{
Console.WriteLine("theMatch: {0}",
theMatch.ToString( ));
}
}
}
}
}
Output:
theMatch.Length: 5
theMatch: This
theMatch.Length: 3
theMatch: is
theMatch.Length: 2
theMatch: a
theMatch.Length: 5
theMatch: test
Example 10-7 creates a simple string to search:
string string1 = "This is a test string";
and a trivial regular expression to search it:
Regex theReg = new Regex(@"(S+)s");
The string S
finds nonwhitespace, and the plus
sign indicates one or more. The string s
(note
lowercase) indicates whitespace. Thus, together, this string looks
for any nonwhitespace characters followed by whitespace.
Remember that the at (@
) symbol before the string
creates a verbatim string, which avoids the necessity of escaping the
backslash () character.
The output shows that the first four words were found. The final word
was not found because it is not followed by a space. If you insert a
space after the word string
and before the closing
quote marks, this program will find that word as well.
The length
property is the length of the captured
substring, and will be discussed in Section 10.2.4, later in this chapter.
It is often
convenient to group subexpression matches together so that you can
parse out pieces of the matching string. For example, you might want
to match on IP addresses and group all IPaddresses
found anywhere within the string.
IP addresses are used to locate computers on a network, and typically have the form 123.456.789.012.
The Group
class allows you to create groups of
matches based on regular expression syntax, and represents the
results from a single grouping expression.
A grouping expression names a group and provides a regular
expression; any substring matching the regular expression will be
added to the group. For example, to create an ip
group you might write:
@"(?<ip>(d|.)+)s"
The Match
class derives from
Group
, and has a collection called
"Groups
" which contains all the groups
your Match
finds.
Creation and use of the Groups
collection and
Group
classes is illustrated in Example 10-8.
Example 10-8. Using the Group class
namespace Programming_CSharp { using System; using System.Text.RegularExpressions; class Test { public static void Main( ) { string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com"; // group time = one or more digits or colons followed by space Regex theReg = new Regex(@"(?<time>(d|:)+)s" + // ip address = one or more digits or dots followed by space @"(?<ip>(d|.)+)s" + // site = one or more characters @"(?<site>S+)"); // get the collection of matches MatchCollection theMatches = theReg.Matches(string1); // iterate through the collection foreach (Match theMatch in theMatches) { if (theMatch.Length != 0) { Console.WriteLine(" theMatch: {0}", theMatch.ToString( )); Console.WriteLine("time: {0}", theMatch.Groups["time"]); Console.WriteLine("ip: {0}", theMatch.Groups["ip"]); Console.WriteLine("site: {0}", theMatch.Groups["site"]); } } } } }
Again, Example 10-8 begins by creating a string to search:
string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com";
This string might be one of many recorded in a web server log file or produced as the result of a search of the database. In this simple example there are three columns: one for the time of the log entry, one for an IP address, and one for the site, each separated by spaces; of course, in a real example solving a real-life problem, you might need to do more complex searches and choose to use other delimiters and more complex searches.
In Example 10-8, we want to create a single
Regex
object to search strings of this type and
break them into three groups: time
,
ip
address, and site
. The
regular expression string is fairly simple, so the example is easy to
understand (however, keep in mind that in a real search, you would
probably only use a part of the source string rather than the entire
source string, as shown here:)
// group time = one or more digits or colons // followed by space Regex theReg = new Regex(@"(?<time>(d|:)+)s" + // ip address = one or more digits or dots // followed by space @"(?<ip>(d|.)+)s" + // site = one or more characters @"(?<site>S+)");
Let’s focus on the characters that create the group:
(?<time>
The parentheses create a group. Everything between the opening
parenthesis (just before the question mark) and the closing
parenthesis (in this case, after the +
sign) is a
single unnamed group.
(@"(?<time>(d|:)+)
The string ?<time>
names that group
time
and the group is associated with the matching
text, the regular expression (d|:)+)s"
. This
regular expression can be interpreted as “one or more digits or
colons followed by a space.”
Similarly, the string ?<ip>
names the
ip
group, and ?<site>
names the site
group. As Example 10-7 does, Example 10-8 asks for a
collection of all the matches:
MatchCollection theMatches = theReg.Matches(string1);
Example 10-8 iterates through the
Matches
collection, finding each
Match
object.
If the Length
of the Match
is
greater than 0, a Match
was found; then it prints
the entire match:
Console.WriteLine(" theMatch: {0}", theMatch.ToString( ));
Here’s the output:
theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com
It then gets the “time” group from the
Match's
Groups
collection and
prints that value:
Console.WriteLine("time: {0}", theMatch.Groups["time"]);
This produces the output:
time: 04:03:27
The code then obtains ip
and
site
groups:
Console.WriteLine("ip: {0}", theMatch.Groups["ip"]); Console.WriteLine("site: {0}", theMatch.Groups["site"]);
This produces the output:
ip: 127.0.0.0 site: LibertyAssociates.com
In Example 10-8, the Matches
collection has only one Match
. It is possible,
however, to match more than one expression within a string. To see
this, modify string1
in Example 10-8 to provide several logFile
entries instead of one, as follows:
string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com " + "04:03:28 127.0.0.0 foo.com " + "04:03:29 127.0.0.0 bar.com " ;
This creates three matches in the MatchCollection, theMatches
. Here’s the resulting output:
theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com time: 04:03:27 ip: 127.0.0.0 site: LibertyAssociates.com theMatch: 04:03:28 127.0.0.0 foo.com time: 04:03:28 ip: 127.0.0.0 site: foo.com theMatch: 04:03:29 127.0.0.0 bar.com time: 04:03:29 ip: 127.0.0.0 site: bar.com
In this example, theMatches
contains three
Match
objects. Each time through the outer
foreach
loop we find the next
Match
in the collection and display its contents:
foreach (Match theMatch in theMatches)
For each of the Match
items found, you can print
out the entire match, various groups, or both.
Each
time a Regex
object matches a subexpression, a Capture
instance
is created and added to a CaptureCollection
collection. Each capture object represents a single capture. Each
group has its own capture collection of the matches for the
subexpression associated with the group.
A key property of the Capture
object is its
length,
which is the length of the captured
substring. When you ask Match
for its length, it
is Capture.Length
that you retrieve because
Match
derives from Group
, which
in turn derives from Capture
.
The regular expression inheritance scheme in .NET allows
Match
to include in its interface the methods and
properties of these parent classes. In a sense, a
Group
is-a capture—it
is a capture that encapsulates the idea of grouping subexpressions. A
Match
, in turn, is-a
Group
—it is the encapsulation of all the
groups of subexpressions making up the entire match for this regular
expression. (See Chapter 5 for more about the
is-a relationship and other relationships.)
Typically, you will find only a single Capture
in
a CaptureCollection
; but that need not be so.
Consider what would happen if you were parsing a string in which the
company name might occur in either of two positions. To group these
together in a single match you create the
?<company>
group in two places in your
regular expression pattern:
Regex theReg = new Regex(@"(?<time>(d|:)+)s" + @"(?<company>S+)s" + @"(?<ip>(d|.)+)s" + @"(?<company>S+)s");
This regular expression group captures any matching string of
characters that follows time
, and also any
matching string of characters that follows ip
.
Given this regular expression, you are ready to parse the following
string:
string string1 = "04:03:27 Jesse 0.0.0.127 Liberty ";
The string includes names in both the positions specified. Here is the result:
theMatch: 04:03:27 Jesse 0.0.0.127 Liberty time: 04:03:27 ip: 0.0.0.127 Company: Liberty
What happened? Why is the Company
group showing
Liberty
? Where is the first term, which also
matched? The answer is that the second term overwrote the first. The
group, however, has captured both; and its
Captures
collection can show that to you, as
illustrated in Example 10-9.
Example 10-9. Examining the capture collection
namespace Programming_CSharp { using System; using System.Text.RegularExpressions; class Test { public static void Main( ) { // the string to parse // note that names appear in both // searchable positions string string1 = "04:03:27 Jesse 0.0.0.127 Liberty "; // regular expression which groups company twice Regex theReg = new Regex(@"(?<time>(d|:)+)s" + @"(?<company>S+)s" + @"(?<ip>(d|.)+)s" + @"(?<company>S+)s"); // get the collection of matches MatchCollection theMatches = theReg.Matches(string1); // iterate through the collection foreach (Match theMatch in theMatches) { if (theMatch.Length != 0) { Console.WriteLine("theMatch: {0}", theMatch.ToString( )); Console.WriteLine("time: {0}", theMatch.Groups["time"]); Console.WriteLine("ip: {0}", theMatch.Groups["ip"]); Console.WriteLine("Company: {0}", theMatch.Groups["company"]); // iterate over the captures collection // in the company group within the // groups collection in the matchforeach (Capture cap in
theMatch.Groups["company"].Captures)
{
Console.WriteLine("cap: {0}",cap.ToString( ));
}
} } } } } Output: theMatch: 04:03:27 Jesse 0.0.0.127 Liberty time: 04:03:27 ip: 0.0.0.127 Company: Liberty cap: Jesse cap: Liberty
The code in bold iterates through the Captures
collection for the Company
group.
foreach (Capture cap in theMatch.Groups["company"].Captures)
Let’s review how this line is parsed. The compiler begins by
finding the collection that it will iterate over.
theMatch
is an object that has a collection named
Groups
. The Groups
collection
has an indexer that takes a string and returns a single
Group
object. Thus, the following line returns a
single Group
object:
theMatch.Groups["company"]
The Group
object has a collection named
Captures
. Thus, the following line returns a
Captures
collection for the
Group
stored at
Groups["company"]
within the
theMatch
object:
theMatch.Groups["company"].Captures
The foreach
loop iterates over the
Captures
collection, extracting each element in
turn and assigning it to the local variable cap
,
which is of type Capture
. You can see from the
output that there are two capture elements: Jesse
and Liberty
. The second one overwrites the first
in the group, and so the displayed value is just
Liberty
, but by examining the
Captures
collection you can find both
values
that
were captured.
3.21.12.140