There was a time when people thought of computers exclusively as manipulating numeric values. Early computers were first used to calculate missile trajectories (though recently declassified documents suggest that some were used for code-breaking as well). In any case, there was a time that programming was taught in the math department of major universities and computer science was considered a discipline of mathematics.
Today, most programs are concerned more with strings of characters than with strings of numbers. Typically these strings are used for word processing, document manipulation, and creation of web pages.
C# provides built-in support for a fully functional
string
type. More importantly,
C# treats strings as objects that encapsulate all the manipulation,
sorting, and searching methods normally applied to strings of
characters.
Complex string manipulation and pattern-matching are aided by the use of regular expressions. C# combines the power and complexity of regular expression syntax, originally found only in string manipulation languages such as awk and Perl, with a fully object-oriented design.
In this chapter, you will learn to work with the C#
string
type and the .NET Framework
System.String
class that it aliases. You will see how
to extract substrings, manipulate and concatenate strings, and build
new strings with the StringBuilder
class. In
addition, you will learn how to use the RegEx
class to match strings based on complex regular expressions.
C# treats strings as first-class types that are flexible, powerful, and easy to use.
In C# programming you typically use the C# alias for a Framework type
(e.g., int
for Int32
) but you
are always free to use the underlying type. C# programmers thus use
string
(lowercase) and the underlying Framework
type String
(uppercase) interchangeably.
The declaration of the
String
class is:
public sealed class String : IComparable<T>, ICloneable, IConvertible, IEnumerable<T>
This declaration reveals that the class is sealed, meaning that it is
not possible to derive from the String
class. The
class also implements four system
interfaces—IComparable<T>
,
ICloneable
, IConvertible
, and
IEnumerable<T>
—that dictate
functionality that String
shares with other
classes in the .NET Framework.
Each
string
object is an
immutable
sequence of Unicode characters. The
fact that String
is immutable means that methods
that appear to change the string actually return a modified copy; the
original string remains intact in memory until it is
garbage-collected. This may have performance implications; if you
plan to do significant repeated string manipulation, use a
StringBuilder
(described later).
As seen in Chapter 9, the
IComparable<T>
interface is implemented by types whose values can be ordered.
Strings, for example, can be alphabetized; any given string can be
compared with another string to determine which should come first in
an ordered list.[1]
IComparable
classes implement the
CompareTo
method.
IEnumerable
, also discussed in Chapter 9, lets you use the
foreach
construct to enumerate a
string
as a collection of
char
s.
ICloneable
objects can
create new instances with the same value as the original instance. In
this case, it is possible to clone a string to produce a new string
with the same values (characters) as the original.
ICloneable
classes implement the
Clone()
method.
Actually, because strings are immutable, the
Clone( )
method on String
just returns a reference to the
original string. If you change the cloned string, a new
String
is then created:
string s1 = "One Two Three Four"; string sx = (string)s1.Clone();Console.WriteLine( Object.ReferenceEquals(s1,sx)); sx += " Five"; Console.WriteLine( Object.ReferenceEquals(s1, sx)); Console.WriteLine(sx);
In this case, sx
is created as a clone of
s1
. The first WriteLine
statement will print the word true
; the two
strings variables refer to the same string in memory. When you change
sx
you actually create a new string from the
first, and when the ReferenceEquals
method returns
false, the final WriteLine
statement returns the
contents of the original string with the word
“Five” appended.
IConvertible
classes
provide methods to facilitate conversion to other primitive types
such as ToInt32( )
, ToDouble( )
,
ToDecimal()
, etc.
The
most common way to create a string is to assign a quoted string of
characters, known as a string
literal, to a user-defined variable of type
string
:
string newString = "This is a string literal";
Quoted strings can include
escape
characters,
such as
or
, which begin
with a
backslash character ().
The two shown are used to indicate where line breaks or tabs are to
appear, respectively.
Because the backslash is the escape character, if you want to put a
backslash into a string (e.g., to create a path listing), you must
quote the backslash with a second backslash (\
).
Strings can also be created using
verbatim string
literals, which start with the (@
)
symbol. This tells the String
constructor that the
string should be used verbatim, even if it spans multiple lines or
includes escape characters. In a verbatim string literal, backslashes
and the characters that follow them are simply considered additional
characters of the string. Thus, the following two definitions are
equivalent:
string literalOne = "\\MySystem\MyDirectory\ProgrammingC#.cs"; string verbatimLiteralOne = @"\MySystemMyDirectoryProgrammingC#.cs";
In the first line, a nonverbatim string literal is used, and so the
backslash character () must be
escaped. This means it must be preceded by a
second backslash character. In the second line, a verbatim literal
string is used, so the extra backslash is not needed. A second
example illustrates multiline verbatim strings:
string literalTwo = "Line One Line Two"; string verbatimLiteralTwo = @"Line One Line Two";
If you have double quotes within a verbatim string, you must escape them so that the compiler knows when the verbatim string ends.
Again, these declarations are interchangeable. Which one you use is a matter of convenience and personal style.
Another common way to create a string is to call the
ToString( )
method
on an object and assign the result to a string variable. All the
built-in types override this method to simplify the task of
converting a value (often a numeric value) to a string representation
of that value. In the following example, the
ToString( )
method of an integer type is called to
store its value in a string:
int myInteger = 5; string integerString = myInteger.ToString();
The call to myInteger.ToString( )
returns a
String
object, which is then assigned to
integerString
.
The .NET
String
class
provides a wealth of overloaded constructors that support a variety
of techniques for assigning string values to
string
types. Some of these constructors enable
you to create a string by passing in a character array or character
pointer. Passing in a character array as a parameter to the
constructor of the String
creates a CLR-compliant
new instance of a string. Passing in a character pointer requires the
unsafe
marker as explained in Chapter 22.
The
string
class provides
a host of methods for comparing, searching, and manipulating strings,
the most important of which are shown in Table 10-1.
Table 10-1. Methods and fields for the string class
Method or field |
Purpose |
---|---|
Public static field that represents the empty string. | |
Overloaded public static method that compares two strings. | |
Overloaded public static method that compares two strings without regard to locale or culture. | |
Overloaded public static method that creates a new string from one or more strings. | |
Public static method that creates a new string by copying another. | |
Overloaded public static and instance method that determines if two strings have the same value. | |
Overloaded public static method that formats a string using a format specification. | |
Overloaded public static method that concatenates a specified string between each element of a string array. | |
The string indexer. | |
The number of characters in the instance. | |
Compares this string with another. | |
Copies the specified number of characters to an array of Unicode characters. | |
Indicates whether the specified string matches the end of this string. | |
Determines if two strings have the same value. | |
Returns a new string with the specified string inserted. | |
Reports the index of the last occurrence of a specified character or string within the string. | |
Right-aligns the characters in the string, padding to the left with spaces or a specified character. | |
Left-aligns the characters in the string, padding to the right with spaces or a specified character. | |
Deletes the specified number of characters. | |
Returns the substrings delimited by the specified characters in a string array. | |
Indicates if the string starts with the specified characters. | |
Retrieves a substring. | |
Copies the characters from the string to a character array. | |
Returns a copy of the string in lowercase. | |
Returns a copy of the string in uppercase. | |
Removes all occurrences of a set of specified characters from beginning and end of the string. | |
Behaves like | |
Behaves like |
Example 10-1 illustrates the use of some of these
methods, including Compare( )
,
Concat()
(and the overloaded +
operator), Copy( )
(and the =
operator), Insert()
,
EndsWith( )
, and IndexOf( )
.
Example 10-1. Working with strings
#region Using directives using System; using System.Collections.Generic; using System.Text; #endregion namespace WorkingWithStrings { public classStringTester { static void Main( ) { // create some strings to work with string s1 = "abcd"; string s2 = "ABCD"; string s3 = @"Liberty Associates, Inc. provides custom .NET development, on-site Training and Consulting"; int result; // hold the results of comparisons // compare two strings, case sensitive result = string.Compare( s1, s2 ); Console.WriteLine( "compare s1: {0}, s2: {1}, result: {2} ", s1, s2, result ); // overloaded compare, takes boolean "ignore case" //(true = ignore case) result = string.Compare( s1, s2, true ); Console.WriteLine( "compare insensitive " ); Console.WriteLine( "s4: {0}, s2: {1}, result: {2} ", s1, s2, result ); // concatenation method string s6 = string.Concat( s1, s2 ); Console.WriteLine( "s6 concatenated from s1 and s2: {0}", s6 ); // use the overloaded operator string s7 = s1 + s2; Console.WriteLine( "s7 concatenated from s1 + s2: {0}", s7 ); // the string copy method string s8 = string.Copy( s7 ); Console.WriteLine( "s8 copied from s7: {0}", s8 ); // use the overloaded operator string s9 = s8; Console.WriteLine( "s9 = s8: {0}", s9 ); // three ways to compare. Console.WriteLine( " Does s9.Equals(s8)?: {0}", s9.Equals( s8 ) ); Console.WriteLine( "Does Equals(s9,s8)?: {0}", string.Equals( s9, s8 ) ); Console.WriteLine( "Does s9==s8?: {0}", s9 == s8 ); // Two useful properties: the index and the length Console.WriteLine( " String s9 is {0} characters long. ", s9.Length ); Console.WriteLine( "The 5th character is {1} ", s9.Length, s9[4] ); // test whether a string ends with a set of characters Console.WriteLine( "s3:{0} Ends with Training?: {1} ", s3, s3.EndsWith( "Training" ) ); Console.WriteLine( "Ends with Consulting?: {0}", s3.EndsWith( "Consulting" ) ); // return the index of the substring Console.WriteLine( " The first occurrence of Training " ); Console.WriteLine( "in s3 is {0} ", s3.IndexOf( "Training" ) ); // insert the word excellent before "training" string s10 = s3.Insert( 101, "excellent " ); Console.WriteLine( "s10: {0} ", s10 ); // you can combine the two as follows: string s11 = s3.Insert( s3.IndexOf( "Training" ), "excellent " ); Console.WriteLine( "s11: {0} ", s11 ); } } } Output: compare s1: abcd, s2: ABCD, result: -1 compare insensitive s4: abcd, s2: ABCD, result: 0 s6 concatenated from s1 and s2: abcdABCD s7 concatenated from s1 + s2: abcdABCD s8 copied from s7: abcdABCD s9 = s8: abcdABCD Does s9.Equals(s8)?: True Does Equals(s9,s8)?: True Does s9==s8?: True String s9 is 8 characters long. The 5th character is A s3:Liberty Associates, Inc. provides custom .NET development, on-site Training and Consulting Ends with Training?: False Ends with Consulting?: True The first occurrence of Training in s3 is 101 s10: Liberty Associates, Inc. provides custom .NET development, on-site excellent Training and Consulting s11: Liberty Associates, Inc. provides custom .NET development, on-site excellent Training and Consulting
Example 10-1 begins by declaring three strings:
string s1 = "abcd"; string s2 = "ABCD"; string s3 = @"Liberty Associates, Inc. provides custom .NET development, on-site Training and Consulting";
The first two are string literals, and the third is a
verbatim string literal. We begin by
comparing s1
to
s2
. The
Compare( )
method is a public static method of
string
, and it is overloaded. The first overloaded
version takes two strings and compares them:
// compare two strings, case sensitive result = string.Compare(s1, s2); Console.WriteLine("compare s1: {0}, s2: {1}, result: {2} ", s1, s2, result);
This is a case-sensitive comparison and returns different values, depending on the results of the comparison:
A negative integer, if the first string is less than the second string
0, if the strings are equal
A positive integer, if the first string is greater than the second string
In this case, the output properly indicates that
s1
is “less
than” s2
. In Unicode (as in
ASCII), a lowercase letter has a smaller value than an uppercase
letter:
compare s1: abcd, s2: ABCD, result: -1
The second comparison uses an overloaded version of
Compare( )
that takes
a third, Boolean parameter, whose value determines whether case
should be ignored in the comparison. If the value of this
“ignore case” parameter is
true
, the comparison is made without regard to
case, as in the following:
result = string.Compare(s1,s2, true); Console.WriteLine("compare insensitive "); Console.WriteLine("s4: {0}, s2: {1}, result: {2} ", s1, s2, result);
The result is written with two WriteLine( )
statements to keep the lines short enough to print properly in this
book.
This time the case is ignored and the result is 0
,
indicating that the two strings are identical (without regard to
case):
compare insensitive s4: abcd, s2: ABCD, result: 0
Example 10-1 then
concatenates some strings. There are a
couple of ways to accomplish this. You can use the
Concat( )
method, which is a static public
method of string
:
string s6 = string.Concat(s1,s2);
or you can simply use the overloaded concatenation
(+
) operator:
string s7 = s1 + s2;
In both cases, the output reflects that the concatenation was successful:
s6 concatenated from s1 and s2: abcdABCD s7 concatenated from s1 + s2: abcdABCD
Similarly,
creating a new copy of a string can be
accomplished in two ways. First, you can use the static
Copy( )
method:
string s8 = string.Copy(s7);
This actually creates two separate strings with the same values.
Since strings are immutable, this is wasteful. Better is either to
use the overloaded assignment operator or the
Clone
method (mentioned earlier), both of
which leave you with two variables pointing to the same string in
memory:
string s9 = s8;
The .NET String
class provides three ways to test for
the equality of two strings. First, you can use the overloaded
Equals( )
method and ask s9
directly whether
s8
is of equal value:
Console.WriteLine(" Does s9.Equals(s8)?: {0}", s9.Equals(s8));
A second technique is to pass both strings to
String
’s static method
Equals()
:
Console.WriteLine("Does Equals(s9,s8)?: {0}", string.Equals(s9,s8));
A final method is to use the equality operator
(==
) of String
:
Console.WriteLine("Does s9==s8?: {0}", s9 == s8);
In each case, the returned result is a Boolean value, as shown in the output:
Does s9.Equals(s8)?: True Does Equals(s9,s8)?: True Does s9==s8?: True
The
next several lines in Example 10-1 use the index operator ([]
)
to find a particular character within a string, and use the
Length
property to return the length of the entire string:
Console.WriteLine(" String s9 is {0} characters long.", s9.Length); Console.WriteLine("The 5th character is {1} ", s9.Length, s9[4]);
Here’s the output:
String s9 is 8 characters long. The 5th character is A
The
EndsWith( )
method
asks a string whether a substring is found at the end of the string.
Thus, you might first ask s3
if it ends with
Training
(which it doesn’t) and
then if it ends with Consulting
(which it does):
// test whether a string ends with a set of characters Console.WriteLine("s3:{0} Ends with Training?: {1} ", s3, s3.EndsWith("Training") ); Console.WriteLine("Ends with Consulting?: {0}", s3.EndsWith("Consulting"));
The output reflects that the first test fails and the second succeeds:
s3:Liberty Associates, Inc. provides custom .NET development, on-site Training and Consulting Ends with Training?: False Ends with Consulting?: True
The IndexOf( )
method
locates a substring within our string, and the
Insert( )
method
inserts a new substring into a copy of the original
string.
The following code locates the first occurrence of
Training
in s3
:
Console.WriteLine(" The first occurrence of Training "); Console.WriteLine ("in s3 is {0} ", s3.IndexOf("Training"));
The output indicates that the offset is 101
:
The first occurrence of Training in s3 is 101
You can then use that value to insert the word
excellent
, followed by a space, into that string.
Actually, the insertion is into a copy of the string returned by the
Insert( )
method and assigned to
s10
:
string s10 = s3.Insert(101,"excellent"); Console.WriteLine("s10: {0} ",s10);
Here’s the output:
s10: Liberty Associates, Inc. provides custom .NET development, on-site excellent Training and Consulting
Finally, you can combine these operations:
string s11 = s3.Insert(s3.IndexOf("Training"),"excellent "); Console.WriteLine("s11: {0} ",s11);
to obtain the identical output:
s11: Liberty Associates, Inc. provides custom .NET development, on-site excellent Training and Consulting
The String
type provides an overloaded
Substring( )
method for extracting substrings from within strings. Both versions
take an index indicating where to begin the extraction, and one of
the two versions takes a second index to indicate where to end the
operation. The Substring( )
method is illustrated
in Example 10-2.
Example 10-2. Using the Substring() method
#region Using directives using System; using System.Collections.Generic; using System.Text; #endregion namespace SubString { public classStringTester { static void Main( ) { // create some strings to work with string s1 = "One Two Three Four"; int ix; // get the index of the last space ix = s1.LastIndexOf( " " ); // get the last word. string s2 = s1.Substring( ix + 1 ); // set s1 to the substring starting at 0 // and ending at ix (the start of the last word // thus s1 has one two three s1 = s1.Substring( 0, ix ); // find the last space in s1 (after two) ix = s1.LastIndexOf( " " ); // set s3 to the substring starting at // ix, the space after "two" plus one more // thus s3 = "three" string s3 = s1.Substring( ix + 1 ); // reset s1 to the substring starting at 0 // and ending at ix, thus the string "one two" s1 = s1.Substring( 0, ix ); // reset ix to the space between // "one" and "two" ix = s1.LastIndexOf( " " ); // set s4 to the substring starting one // space after ix, thus the substring "two" string s4 = s1.Substring( ix + 1 ); // reset s1 to the substring starting at 0 // and ending at ix, thus "one" s1 = s1.Substring( 0, ix ); // set ix to the last space, but there is // none so ix now = -1 ix = s1.LastIndexOf( " " ); // set s5 to the substring at one past // the last space. there was no last space // so this sets s5 to the substring starting // at zero string s5 = s1.Substring( ix + 1 ); Console.WriteLine( "s2: {0} s3: {1}", s2, s3 ); Console.WriteLine( "s4: {0} s5: {1} ", s4, s5 ); Console.WriteLine( "s1: {0} ", s1 ); } } } Output: s2: Four s3: Three s4: Two s5: One s1: One
Example 10-2 is not an elegant solution to the
problem of extracting words from a string, but it is a good first
approximation, and it illustrates a useful technique. The example
begins by creating a string, s1
:
string s1 = "One Two Three Four";
Then ix
is assigned the value of the
last space in the string:
ix=s1.LastIndexOf(" ");
Then the substring that begins one space later is assigned to the new
string, s2
:
string s2 = s1.Substring(ix+1);
This extracts ix+1
to the end of the line,
assigning to s2
the value Four
.
The next step is to remove the word Four
from
s1
. You can do this by assigning to
s1
the substring of s1
, which
begins at 0
and ends at ix
:
s1 = s1.Substring(0,ix);
Reassign ix
to the last (remaining) space, which
points you to the beginning of the word Three
,
which we then extract into string s3
. Continue
like this until s4
and s5
are
populated. Finally, print the results:
s2: Four s3: Three s4: Two s5: One s1: One
This isn’t elegant, but it works and it illustrates
the use of Substring
. This is not unlike using
pointer arithmetic in C++, but without the pointers and unsafe code.
A more
effective solution to the problem illustrated in Example 10-2 is to use the Split( )
method of String
, whose job is to parse a string
into substrings. To use Split( )
, pass in an array
of delimiters (characters that will indicate a split in the words),
and the method returns an array of substrings. Example 10-3 illustrates.
Example 10-3. Using the Split() method
#region Using directives using System; using System.Collections.Generic; using System.Text; #endregion namespace StringSplit { public classStringTester { static void Main( ) { // create some strings to work with string s1 = "One,Two,Three Liberty Associates, Inc."; // constants for the space and comma characters const char Space = ' '; const char Comma = ','; // array of delimiters to split the sentence with char[] delimiters = new char[] { Space, Comma }; string output = ""; int ctr = 1; // split the string and then iterate over the // resulting array of strings foreach ( string subString in s1.Split( delimiters ) ) { output += ctr++; output += ": "; output += subString; output += " "; } Console.WriteLine( output ); } } } Output: 1: One 2: Two 3: Three 4: Liberty 5: Associates 6: 7: Inc.
You start by creating a string to parse:
string s1 = "One,Two,Three Liberty Associates, Inc.";
The delimiters are set to the space and comma characters. You then
call Split( )
on this string, and pass the results
to the foreach
loop:
foreach (string subString in s1.Split(delimiters))
Because Split
uses the
params
keyword, you can reduce your
code to:
foreach (string subString in s1.Split(' ', ','))
This eliminates the declaration of the array entirely.
Start by initializing output to an empty string and then build up the
output string in four steps. Concatenate the value of
ctr
. Next add the colon, then the substring
returned by split, then the newline. With each concatenation, a new
copy of the string is made, and all four steps are repeated for each
substring found by Split( )
. This repeated copying
of string
is terribly inefficient.
The problem is that the string type is not designed for this kind of
operation. What you want is to create a new string by appending a
formatted string each time through the loop. The class you need is
StringBuilder
.
The
System.Text.StringBuilder
class is used for creating and modifying strings. The important
members of
StringBuilder
are summarized in Table 10-2.
Table 10-2. StringBuilder methods
Method |
Explanation |
---|---|
The indexer. | |
Retrieves or assigns the length of the
| |
Overloaded public method that appends a string of characters to the
end of the current | |
Overloaded public method that replaces format specifiers with the formatted value of an object . | |
Overloaded public method that inserts a string of characters at the specified position. | |
Removes the specified characters. | |
Overloaded public method that replaces all instances of specified characters with new characters. |
Unlike String
, StringBuilder
is
mutable; when you modify a StringBuilder
, you
modify the actual string, not a copy. Example 10-4
replaces the String
object in Example 10-3 with a
StringBuilder
object.
Example 10-4. Using a StringBuilder
#region Using directives using System; using System.Collections.Generic; using System.Text; #endregion namespace UsingStringBuilder { public classStringTester { static void Main( ) { // create some strings to work with string s1 = "One,Two,Three Liberty Associates, Inc."; // constants for the space and comma characters const char Space = ' '; const char Comma = ','; // array of delimiters to split the sentence with char[] delimiters = new char[] { Space, Comma }; // use a StringBuilder class to build the // output string StringBuilder output = new StringBuilder( ); int ctr = 1; // split the string and then iterate over the // resulting array of strings foreach ( string subString in s1.Split( delimiters ) ) { // AppendFormat appends a formatted string output.AppendFormat( "{0}: {1} ", ctr++, subString ); } Console.WriteLine( output ); } } }
Only the last part of the program is modified.
Instead of using the concatenation operator to modify the string, use
the
AppendFormat( )
method of
StringBuilder
to append new, formatted strings as
you create them. This is more efficient. The output is identical to
that of Example 10-3:
1: One 2: Two 3: Three 4: Liberty 5: Associates 6: 7: Inc.
Regular expressions are a powerful language for describing and manipulating text. A regular expression is applied to a string—that is, to a set of characters. Often that string is an entire text document.
The result of applying a regular expression to a string is to find out if the string matches the regular expression or to return a substring, or to return a new string representing a modification of some part of the original string. (Remember that strings are immutable and so can’t be changed by the regular expression.)
By applying a properly constructed regular expression to the following string:
One,Two,Three Liberty Associates, Inc.
you can return any or all of its substrings (e.g.,
Liberty
or One
), or modified
versions of its substrings (e.g., LIBeRtY
or
OnE
). What the regular expression
does is determined by the syntax of the regular
expression itself.
A regular expression consists of two types of characters: literals and metacharacters. A literal is a character you wish to match in the target string. A metacharacter is a special symbol that acts as a command to the regular expression parser. The parser is the engine responsible for understanding the regular expression. For example, if you create a regular expression:
^(From|To|Subject|Date):
this will match any substring with the letters
"From
,”
"To
,”
"Subject
,” or
"Date
,” so long
as those letters start a new line (^
) and end with
a colon (:).
The caret (^
) in
this case indicates to the regular expression parser that the string
you’re searching for must begin a new line. The
letters "From
"
and "To
" are
literals, and the metacharacters left and right parentheses
((
, )
) and
vertical bar
(|
) are all used to group sets of literals and
indicate that any of the choices should match. (Note that
^
is a metacharacter as well, used to indicate the
start of the line.)
Thus, you would read this line:
^(From|To|Subject|Date):
as follows: “Match any string that begins a new line
followed by any of the four literal strings From
,
To
, Subject
, or
Date
followed by a colon.”
A full explanation of regular expressions is beyond the scope of this book, but all the regular expressions used in the examples are explained. For a complete understanding of regular expressions, I highly recommend Mastering Regular Expressions (O’Reilly).
The .NET Framework provides an object-oriented approach to regular expression matching and replacement.
C#’s regular
expressions are based on Perl 5 regexp,
including lazy quantifiers (??
,
*?
, +?
,
{n,m}?
), positive and negative look ahead, and
conditional evaluation.
The
namespace
System.Text.RegularExpressions
is the home to all the .NET Framework objects associated with regular
expressions. The central class for regular expression support is
Regex
, which represents an immutable, compiled
regular expression. Although instances of Regex
can be created, the class also provides a number of useful static
methods. The use of Regex
is illustrated in Example 10-5.
Example 10-5. Using the Regex class for regular expressions
#region Using directives using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; #endregion namespace UsingRegEx { public classTester { static void Main( ) { string s1 = "One,Two,Three Liberty Associates, Inc."; Regex theRegex = new Regex( " |, |," ); StringBuilder sBuilder = new StringBuilder( ); int id = 1; foreach ( string subString in theRegex.Split( s1 ) ) { sBuilder.AppendFormat( "{0}: {1} ", id++, subString ); } Console.WriteLine( "{0}", sBuilder ); } } } Output: 1: One 2: Two 3: Three 4: Liberty 5: Associates 6: Inc.
Example 10-5 begins by creating a string,
s1
, that is identical to the string used in Example 10-4:
string s1 = "One,Two,Three Liberty Associates, Inc.";
It also creates a regular expression, which will be used to search that string:
Regex theRegex = new Regex(" |,|, ");
One of the overloaded constructors for Regex
takes
a regular expression string as its parameter. This is a bit
confusing. In the context of a C# program, which is the regular
expression? Is it the text passed in to the constructor, or the
Regex
object itself? It is true that the text
string passed to the constructor is a regular expression in the
traditional sense of the term. From an object-oriented C# point of
view, however, the argument to the constructor is just a string of
characters; it is theRegex
that is the regular
expression object.
The rest of the program proceeds like the earlier Example 10-4, except that instead of calling
Split()
on string s1, the
Split( )
method of Regex
is
called. Regex.Split()
acts in much the same way as
String.Split( )
, returning an array of strings as a
result of matching the regular expression pattern within
theRegex
.
Regex.Split()
is overloaded. The simplest version
is called on an instance of Regex
, as shown in
Example 10-5. There is also a static version of this
method, which takes a string to search and the pattern to search
with, as illustrated in Example 10-6.
Example 10-6. Using static Regex.Split()
#region Using directives using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; #endregion namespace RegExSplit { public classTester { static void Main( ) { string s1 = "One,Two,Three Liberty Associates, Inc."; StringBuilder sBuilder = new StringBuilder( ); int id = 1; foreach ( string subStr in Regex.Split( s1, " |, |," ) ) { sBuilder.AppendFormat( "{0}: {1} ", id++, subStr ); } Console.WriteLine( "{0}", sBuilder ); } } }
Example 10-6 is identical to Example 10-5, except that the latter example
doesn’t instantiate an object of type
Regex
. Instead, Example 10-6 uses
the static version of Split()
, which takes two
arguments: a string to search for and a regular expression string
that represents the pattern to match.
The instance method of Split()
is also overloaded
with versions that limit the number of times the split will occur and
also determine the position within the target string where the search
will begin.
Two additional classes in the .NET
RegularExpressions
namespace allow you to search a
string repeatedly, and to return the results in a collection.
The collection returned is of type
MatchCollection
, which
consists of zero or more Match
objects. Two
important properties of a Match
object are its
length and its value, each of which can be read as illustrated in
Example 10-7.
Example 10-7. Using MatchCollection and Match
#region Using directives using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; #endregion namespace UsingMatchCollection { classTest { public static void Main( ) { string string1 = "This is a test string"; // find any nonwhitespace followed by whitespace Regex theReg = new Regex( @"(S+)s" ); // get the collection of matches MatchCollection theMatches = theReg.Matches( string1 ); // iterate through the collection foreach ( Match theMatch in theMatches ) { Console.WriteLine( "theMatch.Length: {0}", theMatch.Length ); if ( theMatch.Length != 0 ) { Console.WriteLine( "theMatch: {0}", theMatch.ToString( ) ); } } } } } Output: theMatch.Length: 5 theMatch: This theMatch.Length: 3 theMatch: is theMatch.Length: 2 theMatch: a theMatch.Length: 5 theMatch: test
Example 10-7 creates a simple string to search:
string string1 = "This is a test string";
and a trivial regular expression to search it:
Regex theReg = new Regex(@"(S+)s");
The string S
finds nonwhitespace, and the plus
sign indicates one or more. The string s
(note
lowercase) indicates whitespace. Thus, together, this string looks
for any nonwhitespace characters followed by whitespace.
Remember the at (@
) symbol before the string creates a
verbatim string, which avoids having to escape the backslash
()
character.
The output shows that the first four words were found. The final word
wasn’t found because it isn’t
followed by a space. If you insert a space after the word
string
and before the closing quotation marks,
this program finds that word as well.
The length
property is the length of the captured
substring, and is discussed in the section “Using
CaptureCollection,” later in this chapter.
It is often convenient to group subexpression matches together so that you can parse out pieces of the matching string. For example, you might want to match on IP addresses and group all IP addresses found anywhere within the string.
IP addresses are used to locate computers on a network, and typically have the form x.x.x.x , where x is generally any digit between 0 and 255 (such as 192.168.0.1).
The Group
class allows you to create groups of
matches based on regular expression syntax, and represents the
results from a single grouping expression.
A grouping expression names a group and provides a regular
expression; any substring matching the regular expression will be
added to the group. For example, to create an ip
group you might write:
@"(?<ip>(d|.)+)s"
The Match
class derives from Group
, and has a collection
called Groups
that contains all the groups your
Match
finds.
Creation and use of the Groups
collection and
Group
classes are illustrated in Example 10-8.
Example 10-8. Using the Group class
#region Using directives using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; #endregion namespace RegExGroup { classTest { public static void Main( ) { string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com"; // group time = one or more digits or colons followed by space Regex theReg = new Regex( @"(?<time>(d|:)+)s" + // ip address = one or more digits or dots followed by space @"(?<ip>(d|.)+)s" + // site = one or more characters @"(?<site>S+)" ); // get the collection of matches MatchCollection theMatches = theReg.Matches( string1 ); // iterate through the collection foreach ( Match theMatch in theMatches ) { if ( theMatch.Length != 0 ) { Console.WriteLine( " theMatch: {0}", theMatch.ToString( ) ); Console.WriteLine( "time: {0}", theMatch.Groups["time"] ); Console.WriteLine( "ip: {0}", theMatch.Groups["ip"] ); Console.WriteLine( "site: {0}", theMatch.Groups["site"] ); } } } } }
Again, Example 10-8 begins by creating a string to search:
string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com";
This string might be one of many recorded in a web server log file or produced as the result of a search of the database. In this simple example, there are three columns: one for the time of the log entry, one for an IP address, and one for the site, each separated by spaces. Of course, in an example solving a real-life problem, you might need to do more complex queries and choose to use other delimiters and more complex searches.
In Example 10-8, we want to create a single
Regex
object to search strings of this type and
break them into three groups: time
,
ip
address
, and
site
. The regular expression string is fairly
simple, so the example is easy to understand. However, keep in mind
that in a real search, you would probably use only a part of the
source string rather than the entire source string, as shown here.
// group time = one or more digits or colons // followed by space Regex theReg = new Regex(@"(?<time>(d|:)+)s" + // ip address = one or more digits or dots // followed by space @"(?<ip>(d|.)+)s" + // site = one or more characters @"(?<site>S+)");
Let’s focus on the characters that create the group:
(?<time>(d|:)
The parentheses create a group.
Everything between the opening parenthesis (just before the question
mark) and the closing parenthesis (in this case, after the
+
sign) is a single unnamed group.
The string ?<time>
names that group
time
, and the group is associated with the
matching text, which is the regular expression
(d|:)+)s
“. This regular
expression can be interpreted as “one or more digits
or colons followed by a space.”
Similarly, the string ?<ip>
names the
ip
group, and ?<site>
names the site
group. As Example 10-7 does, Example 10-8 asks for a
collection of all the matches:
MatchCollection theMatches = theReg.Matches(string1);
Example 10-8 iterates through the
Matches
collection, finding each
Match
object.
If the Length
of the Match
is
greater than 0
, a Match
was
found; it prints the entire match:
Console.WriteLine(" theMatch: {0}", theMatch.ToString());
Here’s the output:
theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com
It then gets the time
group from
theMatch.Groups
collection and prints that value:
Console.WriteLine("time: {0}", theMatch.Groups["time"]);
This produces the output:
time: 04:03:27
The code then obtains ip
and
site
groups:
Console.WriteLine("ip: {0}", theMatch.Groups["ip"]); Console.WriteLine("site: {0}", theMatch.Groups["site"]);
This produces the output:
ip: 127.0.0.0 site: LibertyAssociates.com
In Example 10-8, the Matches
collection has only one Match
. It is possible,
however, to match more than one expression within a string. To see
this, modify string1
in Example 10-8 to provide several logFile
entries instead of one, as follows:
string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com " + "04:03:28 127.0.0.0 foo.com " + "04:03:29 127.0.0.0 bar.com " ;
This creates three matches in the MatchCollection
,
called theMatches
. Here’s the
resulting output:
theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com time: 04:03:27 ip: 127.0.0.0 site: LibertyAssociates.com theMatch: 04:03:28 127.0.0.0 foo.com time: 04:03:28 ip: 127.0.0.0 site: foo.com theMatch: 04:03:29 127.0.0.0 bar.com time: 04:03:29 ip: 127.0.0.0 site: bar.com
In this example, theMatches
contains three
Match
objects. Each time through the outer
foreach
loop we find the next
Match
in the collection and display its contents:
foreach (Match theMatch in theMatches)
For each Match
item found, you can print out the
entire match, various groups, or both.
Each
time a Regex
object
matches a subexpression, a Capture
instance is
created and added to a CaptureCollection
collection. Each Capture
object represents a
single capture. Each group has its own capture collection of the
matches for the subexpression associated with the group.
A key property of the Capture
object is its
length
,
which is the length of the captured substring. When you ask
Match
for its length, it is
Capture.Length
that you retrieve because
Match
derives from Group
, which
in turn derives from Capture
.
The regular expression
inheritance
scheme in .NET allows Match
to include in its
interface the methods and properties of these parent classes. In a
sense, a Group
is-a capture:
it is a capture that encapsulates the idea of grouping
subexpressions. A Match
, in turn,
is-a
Group
: it is the
encapsulation of all the groups of subexpressions making up the
entire match for this regular expression. (See Chapter 5 for more about the
is-a relationship and other relationships.)
Typically, you will find only a single Capture
in
a CaptureCollection
, but that need not be so.
Consider what would happen if you were parsing a string in which the
company name might occur in either of two positions. To group these
together in a single match, create the
?<company>
group in two places in your
regular expression pattern:
Regex theReg = new Regex(@"(?<time>(d|:)+)s" + @"(?<company>S+)s" + @"(?<ip>(d|.)+)s" + @"(?<company>S+)s");
This regular expression group captures any matching string of
characters that follows time
, and also any
matching string of characters that follows ip
.
Given this regular expression, you are ready to parse the following
string:
string string1 = "04:03:27 Jesse 0.0.0.127 Liberty ";
The string includes names in both the positions specified. Here is the result:
theMatch: 04:03:27 Jesse 0.0.0.127 Liberty time: 04:03:27 ip: 0.0.0.127 Company: Liberty
What happened? Why is the Company
group showing
Liberty
? Where is the first term, which also
matched? The answer is that the second term overwrote the first. The
group, however, has captured both. Its Captures
collection can demonstrate, as illustrated in Example 10-9.
Example 10-9. Examining the Captures collection
#region Using directives using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; #endregion namespace CaptureCollection { classTest { public static void Main( ) { // the string to parse // note that names appear in both // searchable positions string string1 = "04:03:27 Jesse 0.0.0.127 Liberty "; // regular expression which groups company twice Regex theReg = new Regex( @"(?<time>(d|:)+)s" + @"(?<company>S+)s" + @"(?<ip>(d|.)+)s" + @"(?<company>S+)s" ); // get the collection of matches MatchCollection theMatches = theReg.Matches( string1 ); // iterate through the collection foreach ( Match theMatch in theMatches ) { if ( theMatch.Length != 0 ) { Console.WriteLine( "theMatch: {0}", theMatch.ToString( ) ); Console.WriteLine( "time: {0}", theMatch.Groups["time"] ); Console.WriteLine( "ip: {0}", theMatch.Groups["ip"] ); Console.WriteLine( "Company: {0}", theMatch.Groups["company"] ); // iterate over the captures collection // in the company group within the // groups collection in the match foreach ( Capture cap in theMatch.Groups["company"].Captures ) { Console.WriteLine( "cap: {0}", cap.ToString( ) ); } } } } } } Output: theMatch: 04:03:27 Jesse 0.0.0.127 Liberty time: 04:03:27 ip: 0.0.0.127 Company: Liberty cap: Jesse cap: Liberty
The code in bold iterates through the Captures
collection for the Company
group:
foreach (Capture cap in theMatch.Groups["company"].Captures)
Let’s review how this line is parsed. The compiler
begins by finding the collection that it will iterate over.
theMatch
is an object that has a collection named
Groups
. The Groups
collection
has an indexer that takes a string and returns a single
Group
object. Thus, the following line returns a
single Group
object:
theMatch.Groups["company"]
The Group
object has a collection named
Captures
. Thus, the following line returns a
Captures
collection for the
Group
stored at
Groups["company"]
within the
theMatch
object:
theMatch.Groups["company"].Captures
The foreach
loop iterates over the
Captures
collection, extracting each element in
turn and assigning it to the local variable cap
,
which is of type Capture
. You can see from the
output that there are two capture elements: Jesse
and Liberty
. The second one overwrites the first
in the group, and so the displayed value is just
Liberty
. However, by examining the
Captures
collection, you can find both values that
were
captured.
[1] Ordering the
string
is one of a number of lexical operations that act on the value of the
string and take into account culture-specific information based on
the explicitly declared culture or the implicit current culture.
Therefore, if the current culture is U.S. English (as is assumed
throughout this book), the Compare
method
considers 'a
' less than 'A
‘.
CompareOrdinal
performs an ordinal comparison, and
thus regardless of culture, 'a
' is greater than
'A
‘.
3.17.157.6