Chapter 10. Strings and Regular Expressions

There was a time when people thought of computers exclusively as manipulating numeric values. Early computers were first used to calculate missile trajectories (though recently declassified documents suggest that some were used for code-breaking as well). In any case, there was a time that programming was taught in the math department of major universities, and computer science was considered a discipline of mathematics.

Today, most programs are concerned more with strings of characters than with strings of numbers. Typically, these strings are used for word processing, document manipulation, and creation of web pages.

C# provides built-in support for a fully functional string type. More important, C# treats strings as objects that encapsulate all the manipulation, sorting, and searching methods normally applied to strings of characters.

Tip

C programmers take note: in C#, string is a first-class type, not an array of characters.

Complex string manipulation and pattern-matching are aided by the use of regular expressions. C# combines the power and complexity of regular expression syntax, originally found only in string manipulation languages such as awk and Perl, with a fully object-oriented design.

In this chapter, you will learn to work with the C# string type and the .NET Framework System.String class that it aliases. You will see how to extract substrings, manipulate and concatenate strings, and build new strings with the StringBuilder class. In addition, you will learn how to use the RegEx class to match strings based on complex regular expressions.

Strings

C# treats strings as first-class types that are flexible, powerful, and easy to use.

Tip

In C# programming, you typically use the C# alias for a Framework type (e.g., int for Int32), but you are always free to use the underlying type. C# programmers thus use string (lowercase) and the underlying Framework type String (uppercase) interchangeably.

The declaration of the String class is:

public sealed class String :
  IComparable, IComparable<String>, ICloneable, IConvertible,
  IEnumerable, IEnumerable<char>, IEquatable<String>

This declaration reveals that the class is sealed, meaning that it is not possible to derive from the String class. The class also implements seven system interfaces—IComparable, IComparable<String>, ICloneable, IConvertible, IEnumerable, IEnumerable<String>, and IEquatable<String>—that dictate functionality that String shares with other classes in the .NET Framework.

Tip

Each string object is an immutable sequence of Unicode characters. The fact that String is immutable means that methods that appear to change the string actually return a modified copy; the original string remains intact in memory until it is garbage-collected. This may have performance implications; if you plan to do significant repeated string manipulation, use a StringBuilder (described later).

As explained in Chapter 9, the IComparable<String> interface is implemented by types whose values can be ordered. Strings, for example, can be alphabetized; any given string can be compared with another string to determine which should come first in an ordered list.[14] IComparable classes implement the CompareTo method. IEnumerable, also discussed in Chapter 9, lets you use the foreach construct to enumerate a string as a collection of chars.

ICloneable objects can create new instances with the same value as the original instance. In this case, it is possible to clone a string to produce a new string with the same values (characters) as the original. ICloneable classes implement the Clone( ) method.

Tip

Actually, because strings are immutable, the Clone( ) method on String just returns a reference to the original string.

If you use that reference to make a change, a new string is created and the reference created by Clone( ) now points to the new (changed) string:

string s1 = "One Two Three Four";
string sx = (string)s1.Clone(  );Console.WriteLine(
   Object.ReferenceEquals(s1,sx));
sx += " Five";
Console.WriteLine(
   Object.ReferenceEquals(s1, sx));
Console.WriteLine(sx);

In this case, sx is created as a clone of s1. The first WriteLine statement will print the word true; the two string variables refer to the same string in memory. When you change sx, you actually create a new string from the first, and when the ReferenceEquals method returns false, the final WriteLine statement returns the contents of the original string with the word Five appended.

IConvertible classes provide methods to facilitate conversion to other primitive types such as ToInt32( ), ToDouble( ), ToDecimal( ), and so on.

Creating Strings

The most common way to create a string is to assign a quoted string of characters, known as a string literal, to a user-defined variable of type string:

string newString = "This is a string literal";

Quoted strings can include escape characters, such as or , which begin with a backslash character (). The two shown are used to indicate where line breaks or tabs are to appear, respectively.

Tip

Because the backslash is the escape character, if you want to put a backslash into a string (e.g., to create a path listing), you must quote the backslash with a second backslash (\).

Strings can also be created using verbatim string literals, which start with the at (@) symbol. This tells the String constructor that the string should be used verbatim, even if it spans multiple lines or includes escape characters. In a verbatim string literal, backslashes and the characters that follow them are simply considered additional characters of the string. Thus, the following two definitions are equivalent:

string literalOne = "\\MySystem\MyDirectory\ProgrammingC#.cs";
string verbatimLiteralOne = @"\MySystemMyDirectoryProgrammingC#.cs";

In the first line, a nonverbatim string literal is used, and so the backslash character () must be escaped. This means it must be preceded by a second backslash character. In the second line, a verbatim literal string is used, so the extra backslash is not needed. A second example illustrates multiline verbatim strings:

string literalTwo = "Line One
Line Two";
string verbatimLiteralTwo = @"Line One
Line Two";

Tip

If you have double quotes within a verbatim string, you must escape them (with double-double quotes) so that the compiler knows when the verbatim string ends. For example:

String verbatim = @"This is a ""verbatim"" string"

will produce the output:

This is a "verbatim" string

Again, these declarations are interchangeable. Which one you use is a matter of convenience and personal style.

The ToString( ) Method

Another common way to create a string is to call the ToString( ) method on an object and assign the result to a string variable. All the built-in types override this method to simplify the task of converting a value (often a numeric value) to a string representation of that value. In the following example, the ToString( ) method of an integer type is called to store its value in a string:

int myInteger = 5;
string integerString = myInteger.ToString(  );

The call to myInteger.ToString( ) returns a String object, which is then assigned to integerString.

The .NET String class provides a wealth of overloaded constructors that support a variety of techniques for assigning string values to string types. Some of these constructors enable you to create a string by passing in a character array or character pointer. Passing in a character array as a parameter to the constructor of the String creates a CLR-compliant new instance of a string. Passing in a character pointer requires the unsafe marker, as explained in Chapter 23.

Manipulating Strings

The string class provides a host of methods for comparing, searching, and manipulating strings, the most important of which appear in Table 10-1.

Table 10-1. Methods and fields for the string class

Method or field

Purpose

Chars

The string indexer

Compare( )

Overloaded public static method that compares two strings

CompareTo( )

Compares this string with another

Concat( )

Overloaded public static method that creates a new string from one or more strings

Copy( )

Public static method that creates a new string by copying another

CopyTo( )

Copies the specified number of characters to an array of Unicode characters

Empty

Public static field that represents the empty string

EndsWith( )

Indicates whether the specified string matches the end of this string

Equals( )

Overloaded public static and instance method that determines whether two strings have the same value

Format( )

Overloaded public static method that formats a string using a format specification

Join( )

Overloaded public static method that concatenates a specified string between each element of a string array

Length

The number of characters in the instance

Split( )

Returns the substrings delimited by the specified characters in a string array

StartsWith( )

Indicates whether the string starts with the specified characters

Substring( )

Retrieves a substring

ToUpper( )

Returns a copy of the string in uppercase

Trim( )

Removes all occurrences of a set of specified characters from the beginning and end of the string

TrimEnd( )

Behaves like Trim( ), but only at the end

Example 10-1 illustrates the use of some of these methods, including Compare( ), Concat( ) (and the overloaded + operator), Copy( ) (and the = operator), Insert( ), EndsWith( ), and IndexOf( ).

Example 10-1. Working with strings
using System;
using System.Collections.Generic;
using System.Text;

namespace WorkingWithStrings
{
    public static class StringTester
    {
        static void Main(  )
        {
            // create some strings to work with
            string s1 = "abcd";
            string s2 = "ABCD";
            string s3 = @"Liberty Associates, Inc.
               provides custom .NET development,
               on-site Training and Consulting";


            int result; // hold the results of comparisons

            // compare two strings, case sensitive
            result = string.Compare(s1, s2);
            Console.WriteLine(
            "compare s1: {0}, s2: {1}, result: {2}
", s1, s2, result);

            // overloaded compare, takes boolean "ignore case"
            //(true = ignore case)
            result = string.Compare(s1, s2, true);
            Console.WriteLine("compare insensitive
");
            Console.WriteLine("s4: {0}, s2: {1}, result: {2}
", s1, s2, result);

            // concatenation method
            string s6 = string.Concat(s1, s2);
            Console.WriteLine("s6 concatenated from s1 and s2: {0}", s6);

            // use the overloaded operator
            string s7 = s1 + s2;
            Console.WriteLine("s7 concatenated from s1 + s2: {0}", s7);

            // the string copy method
            string s8 = string.Copy(s7);
            Console.WriteLine("s8 copied from s7: {0}", s8);

            // use the overloaded operator
            string s9 = s8;
            Console.WriteLine("s9 = s8: {0}", s9);

            // three ways to compare.
            Console.WriteLine(
            "
Does s9.Equals(s8)?: {0}", s9.Equals(s8));
            Console.WriteLine("Does Equals(s9,s8)?: {0}", string.Equals(s9, s8));
            Console.WriteLine("Does s9==s8?: {0}", s9 == s8);

            // Two useful properties: the index and the length
            Console.WriteLine("
String s9 is {0} characters long. ", s9.Length);
            Console.WriteLine("The 5th character is {0}
", s9[4]);

            // test whether a string ends with a set of characters
            Console.WriteLine("s3:{0}
Ends with Training?: {1}
", s3,
                s3.EndsWith("Training"));
            Console.WriteLine("Ends with Consulting?: {0}",
                s3.EndsWith("Consulting"));

            // return the index of the substring
            Console.WriteLine("
The first occurrence of Training ");
            Console.WriteLine("in s3 is {0}
", s3.IndexOf("Training"));

            // insert the word "excellent" before "training"
            string s10 = s3.Insert(101, "excellent ");
            Console.WriteLine("s10: {0}
", s10);

            // you can combine the two as follows:
            string s11 = s3.Insert(s3.IndexOf("Training"), "excellent ");
            Console.WriteLine("s11: {0}
", s11);
        }
    }
}


Output:
compare s1: abcd, s2: ABCD, result: −1

compare insensitive

s4: abcd, s2: ABCD, result: 0

s6 concatenated from s1 and s2: abcdABCD
s7 concatenated from s1 + s2: abcdABCD
s8 copied from s7: abcdABCD
s9 = s8: abcdABCD

Does s9.Equals(s8)?: True
Does Equals(s9,s8)?: True
Does s9==s8?: True

String s9 is 8 characters long.
The 5th character is A

s3:Liberty Associates, Inc.
 provides custom .NET development,
 on-site Training and Consulting
Ends with Training?: False

Ends with Consulting?: True

The first occurrence of Training
in s3 is 101

s10: Liberty Associates, Inc.
               provides custom .NET development,
               on-site excellent Training and Consulting

s11: Liberty Associates, Inc.
               provides custom .NET development,
               on-site excellent Training and Consulting

Example 10-1 begins by declaring three strings:

string s1 = "abcd";
string s2 = "ABCD";
string s3 = @"Liberty Associates, Inc.
 provides custom .NET development,
 on-site Training and Consulting";

The first two are string literals, and the third is a verbatim string literal. You begin by comparing s1 to s2. The Compare( ) method is a public static method of string, and it is overloaded. The first overloaded version takes two strings and compares them:

// compare two strings, case sensitive
result = string.Compare(s1, s2);
Console.WriteLine("compare s1: {0}, s2: {1}, result: {2}
",
   s1, s2, result);

This is a case-sensitive comparison and returns different values, depending on the results of the comparison:

  • A negative integer, if the first string is less than the second string

  • 0, if the strings are equal

  • A positive integer, if the first string is greater than the second string

In this case, the output properly indicates that s1 is “less than” s2. In Unicode (as in ASCII), when evaluating for English, a lowercase letter has a smaller value than an uppercase letter:

compare s1: abcd, s2: ABCD, result: −1

The second comparison uses an overloaded version of Compare( ) that takes a third, Boolean parameter, whose value determines whether case should be ignored in the comparison. If the value of this “ignore case” parameter is true, the comparison is made without regard to case, as in the following:

result = string.Compare(s1,s2, true);
Console.WriteLine("compare insensitive");
Console.WriteLine("s4: {0}, s2: {1}, result: {2}
", s1, s2, result);

Tip

The result is written with two WriteLine( ) statements to keep the lines short enough to print properly in this book.

This time, the case is ignored and the result is 0, indicating that the two strings are identical (without regard to case):

compare insensitive

s4: abcd, s2: ABCD, result: 0

Example 10-1 then concatenates some strings. There are a couple of ways to accomplish this. You can use the Concat( ) method, which is a static public method of string:

string s6 = string.Concat(s1,s2);

or, you can simply use the overloaded concatenation (+) operator:

string s7 = s1 + s2;

In both cases, the output reflects that the concatenation was successful:

s6 concatenated from s1 and s2: abcdABCD
s7 concatenated from s1 + s2: abcdABCD

Similarly, you can create a new copy of a string in two ways. First, you can use the static Copy( ) method:

string s8 = string.Copy(s7);

This actually creates two separate strings with the same values. Because strings are immutable, this is wasteful. Better is to use either the overloaded assignment operator or the Clone method (mentioned earlier), both of which leave you with two variables pointing to the same string in memory:

string s9 = s8;

The .NET String class provides three ways to test for the equality of two strings. First, you can use the overloaded Equals( ) method and ask s9 directly whether s8 is of equal value:

Console.WriteLine("
Does s9.Equals(s8)?: {0}", s9.Equals(s8));

A second technique is to pass both strings to String’s static method, Equals( ):

Console.WriteLine("Does Equals(s9,s8)?: {0}",
 string.Equals(s9,s8));

A final method is to use the equality operator (==) of String:

Console.WriteLine("Does s9==s8?: {0}", s9 == s8);

In each case, the returned result is a Boolean value, as shown in the output:

Does s9.Equals(s8)?: True
Does Equals(s9,s8)?: True
Does s9==s8?: True

The next several lines in Example 10-1 use the index operator ([]) to find a particular character within a string, and use the Length property to return the length of the entire string:

Console.WriteLine("
String s9 is {0} characters long.", s9.Length);
Console.WriteLine("The 5th character is {1}
", s9.Length, s9[4]);

Here’s the output:

String s9 is 8 characters long.
The 5th character is A

The EndsWith( ) method asks a string whether a substring is found at the end of the string. Thus, you might first ask s3 whether it ends with Training (which it doesn’t), and then whether it ends with Consulting (which it does):

// test whether a string ends with a set of characters
Console.WriteLine("s3:{0}
Ends with Training?: {1}
",
    s3, s3.EndsWith("Training") );
Console.WriteLine("Ends with Consulting?: {0}",
    s3.EndsWith("Consulting"));

The output reflects that the first test fails and the second succeeds:

s3:Liberty Associates, Inc.
               provides custom .NET development,
               on-site Training and Consulting
Ends with Training?: False
Ends with Consulting?: True

The IndexOf( ) method locates a substring within our string, and the Insert( ) method inserts a new substring into a copy of the original string.

The following code locates the first occurrence of Training in s3:

Console.WriteLine("
The first occurrence of Training ");
Console.WriteLine ("in s3 is {0}
", s3.IndexOf("Training"));

The output indicates that the offset is 101:

The first occurrence of Training in s3 is 101

You can then use that value to insert the word excellent, followed by a space, into that string. Actually, the insertion is into a copy of the string returned by the Insert( ) method and assigned to s10:

string s10 = s3.Insert(101,"excellent ");
Console.WriteLine("s10: {0}
",s10);

Here’s the output:

s10: Liberty Associates, Inc.
               provides custom .NET development,
               on-site excellent Training and Consulting

Finally, you can combine these operations:

string s11 = s3.Insert(s3.IndexOf("Training"),"excellent ");
Console.WriteLine("s11: {0}
",s11);

to obtain the identical output:

s11: Liberty Associates, Inc.
               provides custom .NET development,
               on-site excellent Training and Consulting

Finding Substrings

The String type provides an overloaded Substring( ) method for extracting substrings from within strings. Both versions take an index indicating where to begin the extraction, and one of the two versions takes a second index to indicate where to end the operation. Example 10-2 illustrates the Substring( ) method.

Example 10-2. Using the Substring( ) method
using System;
using System.Collections.Generic;
using System.Text;

namespace SubString
{
    public class StringTester
    {
        static void Main(  )
        {
            // create some strings to work with
            string s1 = "One Two Three Four";

            int ix;

            // get the index of the last space
            ix = s1.LastIndexOf(" ");

            // get the last word.
            string s2 = s1.Substring(ix + 1);

            // set s1 to the substring starting at 0
            // and ending at ix (the start of the last word
            // thus s1 has one two three
            s1 = s1.Substring(0, ix);

            // find the last space in s1 (after two)
            ix = s1.LastIndexOf(" ");

            // set s3 to the substring starting at
            // ix, the space after "two" plus one more
            // thus s3 = "three"
            string s3 = s1.Substring(ix + 1);

            // reset s1 to the substring starting at 0
            // and ending at ix, thus the string "one two"
            s1 = s1.Substring(0, ix);

            // reset ix to the space between
            // "one" and "two"
            ix = s1.LastIndexOf(" ");

            // set s4 to the substring starting one
            // space after ix, thus the substring "two"
            string s4 = s1.Substring(ix + 1);

            // reset s1 to the substring starting at 0
            // and ending at ix, thus "one"
            s1 = s1.Substring(0, ix);

            // set ix to the last space, but there is
            // none, so ix now = −1
            ix = s1.LastIndexOf(" ");

            // set s5 to the substring at one past
            // the last space. There was no last space
            // so this sets s5 to the substring starting
            // at zero
            string s5 = s1.Substring(ix + 1);

            Console.WriteLine("s2: {0}
s3: {1}", s2, s3);
            Console.WriteLine("s4: {0}
s5: {1}
", s4, s5);
            Console.WriteLine("s1: {0}
", s1);
        }
    }
}


Output:
s2: Four
s3: Three
s4: Two
s5: One

s1: One

Example 10-2 is not an elegant solution to the problem of extracting words from a string, but it is a good first approximation, and it illustrates a useful technique. The example begins by creating a string, s1:

string s1 = "One Two Three Four";

Next, ix is assigned the value of the last space in the string:

ix=s1.LastIndexOf(" ");

Then, the substring that begins one space later is assigned to the new string, s2:

string s2 = s1.Substring(ix+1);

This extracts ix+1 to the end of the line, assigning to s2 the value Four. The next step is to remove the word Four from s1. You can do this by assigning to s1 the substring of s1, which begins at 0, and ends at ix:

s1 = s1.Substring(0,ix);

Reassign ix to the last (remaining) space, which points you to the beginning of the word Three, which we then extract into string s3. Continue like this until s4 and s5 are populated. Finally, print the results:

s2: Four
s3: Three
s4: Two
s5: One

s1: One

This isn’t elegant, but it works, and it illustrates the use of Substring. This is not unlike using pointer arithmetic in C++, but without the pointers and unsafe code.

Splitting Strings

A more effective solution to the problem illustrated in Example 10-2 is to use the Split( ) method of String, whose job is to parse a string into substrings. To use Split( ), pass in an array of delimiters (characters that will indicate a split in the words), and the method returns an array of substrings. Example 10-3 illustrates.

Example 10-3. Using the Split( ) method
using System;
using System.Collections.Generic;
using System.Text;

namespace StringSplit
{
    public class StringTester
    {
        static void Main(  )
        {
            // create some strings to work with
            string s1 = "One,Two,Three Liberty Associates, Inc.";

            // constants for the space and comma characters
            const char Space = ' ';
            const char Comma = ',';

            // array of delimiters to split the sentence with
            char[] delimiters = new char[] {Space, Comma};

            string output = "";
            int ctr = 1;

            // split the string and then iterate over the
            // resulting array of strings
            foreach (string subString in s1.Split(delimiters))
            {
                output += ctr++;
                output += ": ";
                output += subString;
                output += "
";
            }
            Console.WriteLine(output);
        }
    }
}
Output:
1: One
2: Two
3: Three
4: Liberty
5: Associates
6:
7: Inc.

You start by creating a string to parse:

string s1 = "One,Two,Three Liberty Associates, Inc.";

The delimiters are set to the space and comma characters. You then call Split( ) on this string, and pass the results to the foreach loop:

foreach (string subString in s1.Split(delimiters))

Tip

Because Split uses the params keyword, you can reduce your code to:

foreach (string subString in s1.Split(' ', ','))

This eliminates the declaration of the array entirely.

Start by initializing output to an empty string, and then build up the output string in four steps. Concatenate the value of ctr. Next add the colon, then the substring returned by Split( ), then the newline. With each concatenation, a new copy of the string is made, and all four steps are repeated for each substring found by Split( ). This repeated copying of string is terribly inefficient.

The problem is that the string type is not designed for this kind of operation. What you want is to create a new string by appending a formatted string each time through the loop. The class you need is StringBuilder.

Manipulating Dynamic Strings

The System.Text.StringBuilder class is used for creating and modifying strings. Table 10-2 summarizes the important members of StringBuilder.

Table 10-2. StringBuilder methods

Method

Explanation

Chars

The indexer

Length

Retrieves or assigns the length of the StringBuilder

Append( )

Overloaded public method that appends a string of characters to the end of the current StringBuilder

AppendFormat( )

Overloaded public method that replaces format specifiers with the formatted value of an object

Insert( )

Overloaded public method that inserts a string of characters at the specified position

Remove( )

Removes the specified characters

Replace( )

Overloaded public method that replaces all instances of specified characters with new characters

Unlike String, StringBuilder is mutable. Example 10-4 replaces the String object in Example 10-3 with a StringBuilder object.

Example 10-4. Using a StringBuilder
using System;
using System.Collections.Generic;
using System.Text;

namespace UsingStringBuilder
{
    public class StringTester
    {
        static void Main(  )
        {
            // create some strings to work with
            string s1 = "One,Two,Three Liberty Associates, Inc.";

            // constants for the space and comma characters
            const char Space = ' ';
            const char Comma = ',';

            // array of delimiters to split the sentence with
            char[] delimiters = new char[] {Space, Comma};

            // use a StringBuilder class to build the
            // output string
            StringBuilder output = new StringBuilder(  );
            int ctr = 1;

            // split the string and then iterate over the
            // resulting array of strings
            foreach (string subString in s1.Split(delimiters))
            {
                // AppendFormat appends a formatted string
                output.AppendFormat("{0}: {1}
", ctr++, subString);
            }
            Console.WriteLine(output);
        }
    }
}

Only the last part of the program is modified. Instead of using the concatenation operator to modify the string, use the AppendFormat( ) method of StringBuilder to append new, formatted strings as you create them. This is more efficient. The output is identical to that of Example 10-3:

1: One
2: Two
3: Three
4: Liberty
5: Associates
6:
7: Inc.

Regular Expressions

Regular expressions are a powerful language for describing and manipulating text. A regular expression is applied to a string—that is, to a set of characters. Often, that string is an entire text document.

The result of applying a regular expression to a string is one of the following:

  • To find out whether the string matches the regular expression

  • To return a substring

  • To return a new string representing a modification of some part of the original string

(Remember that strings are immutable, and so can’t be changed by the regular expression.)

By applying a properly constructed regular expression to the following string:

One,Two,Three Liberty Associates, Inc.

you can return any or all of its substrings (e.g., Liberty or One), or modified versions of its substrings (e.g., LIBeRtY or OnE). What the regular expression does is determined by the syntax of the regular expression itself.

A regular expression consists of two types of characters: literals and metacharacters. A literal is a character you wish to match in the target string. A metacharacter is a special symbol that acts as a command to the regular expression parser. The parser is the engine responsible for understanding the regular expression. For example, if you create a regular expression:

^(From|To|Subject|Date):

this will match any substring with the letters From, To, Subject, or Date, as long as those letters start a new line (^) and end with a colon (:).

The caret (^) in this case indicates to the regular expression parser that the string you’re searching for must begin a new line. The letters in From and To are literals, and the left and right parentheses (( )) and vertical bar (|) metacharacters are used to group sets of literals and indicate that any of the choices should match. (Note that ^ is a metacharacter as well, used to indicate the start of the line.)

Thus, you would read this line:

^(From|To|Subject|Date):

as follows: “Match any string that begins a new line followed by any of the four literal strings From, To, Subject, or Date followed by a colon.”

Tip

A full explanation of regular expressions is beyond the scope of this book, but all the regular expressions used in the examples are explained. For a complete understanding of regular expressions, I highly recommend Mastering Regular Expressions by Jeffrey E. F. Friedl (O’Reilly).

Using Regular Expressions: Regex

The .NET Framework provides an object-oriented approach to regular expression matching and replacement.

Tip

C#’s regular expressions are based on Perl 5 regexp, including lazy quantifiers (??, *?, +?, {n,m}?), positive and negative look ahead, and conditional evaluation.

The namespace System.Text.RegularExpressions is the home to all the .NET Framework objects associated with regular expressions. The central class for regular expression support is Regex, which represents an immutable, compiled regular expression. Although instances of Regex can be created, the class also provides a number of useful static methods. Example 10-5 illustrates the use of Regex.

Example 10-5. Using the Regex class for regular expressions
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace UsingRegEx
{
    public class Tester
    {
        static void Main(  )
        {
            string s1 = "One,Two,Three Liberty Associates, Inc.";
            Regex theRegex = new Regex(" |, |,");
            StringBuilder sBuilder = new StringBuilder(  );
            int id = 1;

            foreach (string subString in theRegex.Split(s1))
            {
                sBuilder.AppendFormat("{0}: {1}
", id++, subString);
            }
            Console.WriteLine("{0}", sBuilder);
        }
    }
}

Output:
1: One
2: Two
3: Three
4: Liberty
5: Associates
6: Inc.

Example 10-5 begins by creating a string, s1, which is identical to the string used in Example 10-4:

string s1 = "One,Two,Three Liberty Associates, Inc.";

It also creates a regular expression, which will be used to search that string, matching any space, comma, or comma followed by a space:

Regex theRegex = new Regex(" |,|, ");

One of the overloaded constructors for Regex takes a regular expression string as its parameter. This is a bit confusing. In the context of a C# program, which is the regular expression? Is it the text passed in to the constructor, or the Regex object itself? It is true that the text string passed to the constructor is a regular expression in the traditional sense of the term. From an object-oriented C# point of view, however, the argument to the constructor is just a string of characters; it is theRegex that is the regular expression object.

The rest of the program proceeds like the earlier Example 10-4, except that instead of calling Split( ) on string s1, the Split( ) method of Regex is called. Regex.Split( ) acts in much the same way as String.Split( ), returning an array of strings as a result of matching the regular expression pattern within theRegex.

Regex.Split( ) is overloaded. The simplest version is called on an instance of Regex, as shown in Example 10-5. There is also a static version of this method, which takes a string to search and the pattern to search with, as illustrated in Example 10-6.

Example 10-6. Using static Regex.Split( )
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace RegExSplit
{
    public class Tester
    {
        static void Main(  )
        {
            string s1 = "One,Two,Three Liberty Associates, Inc.";
            StringBuilder sBuilder = new StringBuilder(  );
            int id = 1;
            foreach (string subStr in Regex.Split(s1, " |, |,"))
            {
                sBuilder.AppendFormat("{0}: {1}
", id++, subStr);
            }
            Console.WriteLine("{0}", sBuilder);
        }
    }
}

Example 10-6 is identical to Example 10-5, except that the latter example doesn’t instantiate an object of type Regex. Instead, Example 10-6 uses the static version of Split( ), which takes two arguments: a string to search for, and a regular expression string that represents the pattern to match.

The instance method of Split( ) is also overloaded with versions that limit the number of times the split will occur as well as determine the position within the target string where the search will begin.

Using Regex Match Collections

Two additional classes in the .NET RegularExpressions namespace allow you to search a string repeatedly, and to return the results in a collection. The collection returned is of type MatchCollection, which consists of zero or more Match objects. Two important properties of a Match object are its length and its value, each of which can be read as illustrated in Example 10-7.

Example 10-7. Using MatchCollection and Match
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace UsingMatchCollection
{
    class Test
    {
        public static void Main(  )
        {
            string string1 = "This is a test string";

            // find any nonwhitespace followed by whitespace
            Regex theReg = new Regex(@"(/S+)/s");

            // get the collection of matches
            MatchCollection theMatches = theReg.Matches(string1);

            // iterate through the collection
            foreach (Match theMatch in theMatches)
            {
                Console.WriteLine("theMatch.Length: {0}",
                                  theMatch.Length);

                if (theMatch.Length != 0)
                {
                    Console.WriteLine("theMatch: {0}",
                                  theMatch.ToString(  ));
                }
            }
        }
    }
}

Output:
theMatch.Length: 5
theMatch: This
theMatch.Length: 3
theMatch: is
theMatch.Length: 2
theMatch: a
theMatch.Length: 5
theMatch: test

Example 10-7 creates a simple string to search:

string string1 = "This is a test string";

and a trivial regular expression to search it:

Regex theReg = new Regex(@"(S+)s");

The string S finds nonwhitespace, and the plus sign indicates one or more. The string s (note lowercase) indicates whitespace. Thus, together, this string looks for any nonwhitespace characters followed by whitespace.

Tip

Remember that the at (@) symbol before the string creates a verbatim string, which avoids having to escape the backslash () character.

The output shows that the first four words were found. The final word wasn’t found because it isn’t followed by a space. If you insert a space after the word string, and before the closing quotation marks, this program finds that word as well.

The length property is the length of the captured substring, and I discuss it in the section "Using CaptureCollection" later in this chapter.

Using Regex Groups

It is often convenient to group subexpression matches together so that you can parse out pieces of the matching string. For example, you might want to match on IP addresses and group all IP addresses found anywhere within the string.

Tip

IP addresses are used to locate computers on a network, and typically have the form x.x.x.x, where x is generally any digit between 0 and 255 (such as 192.168.0.1).

The Group class allows you to create groups of matches based on regular expression syntax, and represents the results from a single grouping expression.

A grouping expression names a group and provides a regular expression; any substring matching the regular expression will be added to the group. For example, to create an ip group, you might write:

@"(?<ip>(d|.)+)s"

The Match class derives from Group, and has a collection called Groups that contains all the groups your Match finds.

Example 10-8 illustrates the creation and use of the Groups collection and Group classes.

Example 10-8. Using the Group class
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace RegExGroup
{
    class Test
    {
        public static void Main(  )
        {
            string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com";

            // group time = one or more digits or colons followed by space
            Regex theReg = new Regex(@"(?<time>(d|:)+)s" +
                // ip address = one or more digits or dots followed by space
                @"(?<ip>(d|.)+)s" +
                // site = one or more characters
                @"(?<site>S+)");

            // get the collection of matches
            MatchCollection theMatches = theReg.Matches(string1);

            // iterate through the collection
            foreach (Match theMatch in theMatches)
            {
                if (theMatch.Length != 0)
                {
                    Console.WriteLine("
theMatch: {0}",
                        theMatch.ToString(  ));
                    Console.WriteLine("time: {0}",
                        theMatch.Groups["time"]);
                    Console.WriteLine("ip: {0}",
                        theMatch.Groups["ip"]);
                    Console.WriteLine("site: {0}",
                        theMatch.Groups["site"]);
                }
            }
        }
    }
}

Again, Example 10-8 begins by creating a string to search:

string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com";

This string might be one of many recorded in a web server logfile or produced as the result of a search of the database. In this simple example, there are three columns: one for the time of the log entry, one for an IP address, and one for the site, each separated by spaces. Of course, in an example solving a real-life problem, you might need to do more complex queries and choose to use other delimiters and more complex searches.

In Example 10-8, we want to create a single Regex object to search strings of this type and break them into three groups: time, ip address, and site. The regular expression string is fairly simple, so the example is easy to understand. However, keep in mind that in a real search, you would probably use only a part of the source string rather than the entire source string, as shown here:

// group time = one or more digits or colons
// followed by space
Regex theReg = new Regex(@"(?<time>(d|:)+)s" +
// ip address = one or more digits or dots
// followed by space
@"(?<ip>(d|.)+)s" +
// site = one or more characters
@"(?<site>S+)");

Let’s focus on the characters that create the group:

(?<time>(d|:)+)

The parentheses create a group. Everything between the opening parenthesis (just before the question mark) and the closing parenthesis (in this case, after the + sign) is a single unnamed group.

The string ?<time> names that group time, and the group is associated with the matching text, which is the regular expression (d|:)+)s. This regular expression can be interpreted as “one or more digits or colons followed by a space.”

Similarly, the string ?<ip> names the ip group, and ?<site> names the site group. As Example 10-7 does, Example 10-8 asks for a collection of all the matches:

MatchCollection theMatches = theReg.Matches(string1);

Example 10-8 iterates through the Matches collection, finding each Match object.

If the Length of the Match is greater than 0, a Match was found; it prints the entire match:

Console.WriteLine("
theMatch: {0}",
    theMatch.ToString(  ));

Here’s the output:

theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com

It then gets the time group from the theMatch.Groups collection and prints that value:

Console.WriteLine("time: {0}",
    theMatch.Groups["time"]);

This produces the output:

time: 04:03:27

The code then obtains ip and site groups:

Console.WriteLine("ip: {0}",
    theMatch.Groups["ip"]);
Console.WriteLine("site: {0}",
    theMatch.Groups["site"]);

This produces the output:

ip: 127.0.0.0
site: LibertyAssociates.com

In Example 10-8, the Matches collection has only one Match. It is possible, however, to match more than one expression within a string. To see this, modify string1 in Example 10-8 to provide several logFile entries instead of one, as follows:

string string1 = "04:03:27 127.0.0.0 LibertyAssociates.com " +
"04:03:28 127.0.0.0 foo.com " +
"04:03:29 127.0.0.0 bar.com " ;

This creates three matches in the MatchCollection, called theMatches. Here’s the resulting output:

theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com
time: 04:03:27
ip: 127.0.0.0
site: LibertyAssociates.com

theMatch: 04:03:28 127.0.0.0 foo.com
time: 04:03:28
ip: 127.0.0.0
site: foo.com

theMatch: 04:03:29 127.0.0.0 bar.com
time: 04:03:29
ip: 127.0.0.0
site: bar.com

In this example, theMatches contains three Match objects. Each time through the outer foreach loop, we find the next Match in the collection and display its contents:

foreach (Match theMatch in theMatches)

For each Match item found, you can print the entire match, various groups, or both.

Using CaptureCollection

Please note that we are now venturing into advanced use of regular expressions, which themselves are considered a black art by many programmers. Feel free to skip over this section if it gives you a headache, and come back to it if you need it.

Each time a Regex object matches a subexpression, a Capture instance is created and added to a CaptureCollection collection. Each Capture object represents a single capture.

Each group has its own capture collection of the matches for the subexpression associated with the group.

So, taking that apart, if you don’t create Groups, and you match only once, you end up with one CaptureCollection with one Capture object. If you match five times, you end up with one CaptureCollection with five Capture objects in it.

If you don’t create groups, but you match on three subexpressions, you will end up with three CaptureCollections, each of which will have Capture objects for each match for that subexpression.

Finally, if you do create groups (e.g., one group for IP addresses, one group for machine names, one group for dates), and each group has a few capture expressions, you’ll end up with a hierarchy: each group collection will have a number of capture collections (one per subexpression to match), and each group’s capture collection will have a capture object for each match found.

A key property of the Capture object is its length, which is the length of the captured substring. When you ask Match for its length, it is Capture.Length that you retrieve because Match derives from Group, which in turn derives from Capture.

Tip

The regular expression inheritance scheme in .NET allows Match to include in its interface the methods and properties of these parent classes. In a sense, a Group is-a capture: it is a capture that encapsulates the idea of grouping subexpressions. A Match, in turn, is-a Group: it is the encapsulation of all the groups of subexpressions making up the entire match for this regular expression. (See Chapter 5 for more about the is-a relationship and other relationships.)

Typically, you will find only a single Capture in a CaptureCollection, but that need not be so. Consider what would happen if you were parsing a string in which the company name might occur in either of two positions. To group these together in a single match, create the ?<company> group in two places in your regular expression pattern:

Regex theReg = new Regex(@"(?<time>(d|:)+)s" +
@"(?<company>S+)s" +
@"(?<ip>(d|.)+)s" +
@"(?<company>S+)s");

This regular expression group captures any matching string of characters that follows time, as well as any matching string of characters that follows ip. Given this regular expression, you are ready to parse the following string:

string string1 = "04:03:27 Jesse 0.0.0.127 Liberty ";

The string includes names in both of the positions specified. Here is the result:

theMatch: 04:03:27 Jesse 0.0.0.127 Liberty
time: 04:03:27
ip: 0.0.0.127
Company: Liberty

What happened? Why is the Company group showing Liberty? Where is the first term, which also matched? The answer is that the second term overwrote the first. The group, however, has captured both. Its Captures collection can demonstrate, as illustrated in Example 10-9.

Example 10-9. Examining the Captures collection
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace CaptureCollection
{
    class Test
    {
        public static void Main(  )
        {
            // the string to parse
            // note that names appear in both
            // searchable positions
            string string1 =
            "04:03:27 Jesse 0.0.0.127 Liberty ";

            // regular expression that groups company twice
            Regex theReg = new Regex(@"(?<time>(d|:)+)s" +
                            @"(?<company>S+)s" +
                            @"(?<ip>(d|.)+)s" +
                            @"(?<company>S+)s");

            // get the collection of matches
            MatchCollection theMatches =
            theReg.Matches(string1);

            // iterate through the collection
            foreach (Match theMatch in theMatches)
            {
                if (theMatch.Length != 0)
                {
                    Console.WriteLine("theMatch: {0}",
                        theMatch.ToString(  ));
                    Console.WriteLine("time: {0}",
                        theMatch.Groups["time"]);
                    Console.WriteLine("ip: {0}",
                        theMatch.Groups["ip"]);
                    Console.WriteLine("Company: {0}",
                        theMatch.Groups["company"]);

                    // iterate over the captures collection
                    // in the company group within the
                    // groups collection in the matchforeach (Capture cap in
                        theMatch.Groups["company"].Captures)
                    {
                        Console.WriteLine("cap: {0}", cap.ToString(  ));
                    }
                }
            }
        }
    }
}


Output:
theMatch: 04:03:27 Jesse 0.0.0.127 Liberty
time: 04:03:27
ip: 0.0.0.127
Company: Liberty
cap: Jesse
cap: Liberty

The code in bold iterates through the Captures collection for the Company group:

foreach (Capture cap in
    theMatch.Groups["company"].Captures)

Let’s review how this line is parsed. The compiler begins by finding the collection that it will iterate over. theMatch is an object that has a collection named Groups. The Groups collection has an indexer that takes a string and returns a single Group object. Thus, the following line returns a single Group object:

theMatch.Groups["company"]

The Group object has a collection named Captures. Thus, the following line returns a Captures collection for the Group stored at Groups["company"] within the theMatch object:

theMatch.Groups["company"].Captures

The foreach loop iterates over the Captures collection, extracting each element in turn and assigning it to the local variable cap, which is of type Capture. You can see from the output that there are two capture elements: Jesse and Liberty. The second one overwrites the first in the group, and so the displayed value is just Liberty. However, by examining the Captures collection, you can find both values that were captured.



[14] * Ordering the string is one of a number of lexical operations that act on the value of the string and take into account culture-specific information based on the explicitly declared culture or the implicit current culture. Therefore, if the current culture is U.S. English (as is assumed throughout this book), the Compare method considers a less than A. CompareOrdinal performs an ordinal comparison, and thus regardless of culture, a is greater than A.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.247.68