CHAPTER 19

image

Strings

All strings in C# are instances of the System.String type in the Common Language Runtime. Because of this, there are many built-in operations available that work with strings. For example, the String class defines an indexer function that can be used to iterate over the characters of the string.

using System;
class Test
{
    public static void Main()
    {
       string s = "Test String";
       for (int index = 0; index < s.Length; index++)
       {
            Console.WriteLine("Char: {0}", s[index]);
       }
    }
}

Operations

The string class is an example of an immutable type, which means that the characters contained in the string cannot be modified by users of the string. All operations that produce a modification of the input string that are performed by the string class return a modified version of the string rather than modifying the instance on which the method is called. Here’s an example:

string s = "Test String";
s.Replace("Test", "Best");
Console.WriteLine(s);

This takes the string, replaces Test with Best, and then throws away the result. What you want to write is this:

s = s.Replace("Test", "Best");

Immutable types are used to make reference types that have value semantics (in other words, act somewhat like value types).

The String class supports the comparison and searching methods shown in Tables 19-1 and 19-2.

Table 19-1.String Comparison and Search Methods

Item Description
Compare() Compares two strings
CompareOrdinal() Compares two string regions using an ordinal comparison
CompareTo() Compares the current instance with another instance
EndsWith() Determines whether a substring exists at the end of a string
StartsWith() Determines whether a substring exists at the beginning of a string
IndexOf() Returns the position of the first occurrence of a substring
IndexOfAny() Returns the position of the first occurrence of any character in a string
LastIndexOf() Returns the position of the first occurrence of a substring
LastIndexOfAny() Returns the position of the last occurrence of any character in a string

The String class supports the modification methods in Table 19-2, which all return a new string instance.

Table 19-2. String Modification Methods

Item Description
Concat() Concatenates two or more strings or objects together. If objects are passed, the ToString() function is called on them.
CopyTo() Copies a specified number of characters from a location in this string into an array.
Insert() Returns a new string with a substring inserted at a specific location.
Join() Joins an array of strings together with a separator between each array element.
Normalize() Normalizes the string into a Unicode form.
PadLeft() Righ- aligns a string in a field.
PadRight() Left-aligns a string in a field.
Remove() Deletes characters from a string.
Replace() Replaces all instances of a character with a different character.
Split() Creates an array of strings by splitting a string at any occurrence of one or more characters.
Substrng() Extracts a substring from a string.
ToLower() Returns a lowercase version of a string.
ToUpper() Returns an uppercase version of a string.
Trim() Removes whitespace from a string.
TrimEnd() Removes a string of characters from the end of a string.
TrimStart() Removes a string of characters from the beginning of a string.

String Literals

String literals are described in Chapter 32.

String Encodings and Conversions

C# strings are always Unicode strings. When dealing only in the .NET world, this greatly simplifies working with strings.

Unfortunately, it’s sometimes necessary to deal with the messy details of other kinds of strings, especially when dealing with text files produced by older applications. The System.Text namespace contains classes that can be used to convert between an array of bytes and a character encoding such as ASCII, Unicode, UTF7, or UTF8. Each encoding is encapsulated in a class such as ASCIIEncoding.

To convert from a string to a block of bytes, the GetEncoder() method on the encoding class is called to obtain an Encoder, which is then used to do the encoding. Similarly, to convert from a block of bytes to a specific encoding, GetDecoder() is called to obtain a decoder.

Converting Objects to Strings

The function object.ToString() is overridden by the built-in types to provide an easy way of converting from a value to a string representation of that value. Calling ToString() produces the default representation of a value; a different representation may be obtained by calling String.Format(). See the section on formatting in Chapter 39 for more information.

An Example

The split function can be used to break a string into substrings at separators.

using System;
class Test
{
    public static void Main()
    {
       string s = "Oh, I hadn't thought of that";
       char[] separators = new char[] {' ', ','};
       foreach (string sub in s.Split(separators))
       {
            Console.WriteLine("Word: {0}", sub);
       }
    }
}

This example produces the following output:

Word: Oh
Word:
Word: I
Word: hadn't
Word: thought
Word: of

Word: that

The separators character array defines what characters the string will be broken on. The Split() function returns an array of strings, and the foreach statement iterates over the array and prints it out.

In this case, the output isn’t particularly useful because the "," string gets broken twice. This can be fixed by using the regular expression classes.

StringBuilder

Though the String.Format() function can be used to create a string based on the values of other strings, it isn’t necessarily the most efficient way to assemble strings. The runtime provides the StringBuilder class to make this process easier.

The StringBuilder class supports the properties and methods described in Table 19-3 and Table 19-4.

Table 19-3. StringBuilder Properties

Property Description
Capacity Retrieves or sets the number of characters the StringBuilder can hold.
[] The StringBuilder indexer is used to get or set a character at a specific position.
Length Retrieves or sets the length.
MaxCapacity Retrieves the maximum capacity of the StringBuilder.

Table 19-4 . StringBuilder Methods

Method Description
Append() Appends the string representation of an object
AppendFormat() Appends a string representation of an object, using a specific format string for the object
EnsureCapacity() Ensures the StringBuilder has enough room for a specific number of characters
Insert() Inserts the string representation of a specified object at a specified position
Remove() Removes the specified characters
Replace() Replaces all instances of a character with a new character

The following example demonstrates how the StringBuilder class can be used to create a string from separate strings:

using System;
using System.Text;
class Test
{
    public static void Main()
    {
       string s = "I will not buy this record, it is scratched";
       char[] separators = new char[] {' ', ','};
       StringBuilder sb = new StringBuilder();
       int number = 1;
       foreach (string sub in s.Split(separators))
       {
            sb.AppendFormat("{0}: {1} ", number++, sub);
       }
       Console.WriteLine("{0}", sb);
    }
}

This code will create a string with numbered words and will produce the following output:

1: I 2: will 3: not 4: buy 5: this 6: record 7: 8: it 9: is 10: scratched

Because the call to split() specified both the space and the comma as separators, it considers there to be a word between the comma and the following space, which results in an empty entry.

Regular Expressions

If the searching functions found in the String class aren’t powerful enough, the System.Text namespace contains a regular expression class named Regex. Regular expressions provide a very powerful method for doing search and/or replace functions.

While this section has a few examples of using regular expressions, a detailed description of them is beyond the scope of the book. There is considerable information about regular expressions in the MSDN documentation. Several regular expression books are available, and the subject is also covered in most books about Perl. Mastering Regular Expressions, Third Edition (O’Reilly, 2006) by Jeffrey Friedl and Regular Expression Recipes: A Problem-Solution Approach (Apress, 2004) by Nathan A. Good are two great references.

The regular expression class uses a rather interesting technique to get maximum performance. Rather than interpret the regular expression for each match, it writes a short program on the fly to implement the regular expression match, and that code is then run.1

The previous example using Split() can be revised to use a regular expression, rather than single characters, to specify how the split should occur. This will remove the blank word that was found in the preceding example.

// file: regex.cs
using System;
using System.Text.RegularExpressions;
class Test
{
    public static void Main()
    {
       string s = "Oh, I hadn't thought of that";
       Regex regex = new Regex(@" |, ");
       char[] separators = {' ', ','};
       foreach (string sub in regex.Split(s))
       {
            Console.WriteLine("Word: {0}", sub);
       }
    }
}

This example produces the following output:

Word: Oh
Word: I
Word: hadn't
Word: thought
Word: of
Word: that

In the regular expression, the string is split either on a space or on a comma followed by a space.

Regular Expression Options

When creating a regular expression, several options can be specified to control how the matches are performed (see Table 19-5). Compiled is especially useful to speed up searches that use the same regular expression multiple times.

Table 19-5.Regular Expression Options

Option Description
Compiled Compiles the regular expression into a custom implementation so that matches are faster
ExplicitCapture Specifies that the only valid captures are named
IgnoreCase Performs case-insensitive matching
IgnorePatternWhitespace Removes unescaped whitespace from the pattern to allow # comments
Multiline Changes the meaning of ^ and $ so they match at the beginning or end of any line, not the beginning or end of the whole string
RightToLeft Performs searches from right to left rather than from left to right
Singleline Single-line mode, where . matches any character including

More Complex Parsing

Using regular expressions to improve the function of Split() doesn’t really demonstrate their power. The following example uses regular expressions to parse an IIS log file. That log file looks something like this:

#Software: Microsoft Internet Information Server 4.0
#Version: 1.0
#Date: 1999-12-31 00:01:22
#Fields: time c-ip cs-method cs-uri-stem sc-status
00:01:31 157.56.214.169 GET /Default.htm 304
00:02:55 157.56.214.169 GET /docs/project/overview.htm 200

The following code will parse this into a more useful form:

// file = logparse.cs
// compile with: csc logparse.cs
using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
using System.Collections;
class Test
{
    public static void Main(string[] args)
    {
       if (args.Length == 0) // we need a file to parse
       {
            Console.WriteLine("No log file specified.");
       }
       else
       {
            ParseLogFile(args[0]);
       }
    }
    public static void ParseLogFile(string    filename)
    {
       if (!System.IO.File.Exists(filename))
       {
            Console.WriteLine ("The file specified does not exist.");
       }
       else
       {
            FileStream f = new FileStream(filename, FileMode.Open);
            StreamReader stream = new StreamReader(f);
            string line;
            line = stream.ReadLine();    // header line
            line = stream.ReadLine();    // version line
            line = stream.ReadLine();    // Date line
            Regex regexDate = new Regex(@":s(? < date > [^s]+)s");
            Match match = regexDate.Match(line);
            string date = "";
            if (match.Length != 0)
            {
                date = match.Groups["date"].ToString();
            }
            line = stream.ReadLine();    // Fields line
            Regex regexLine =
                new Regex(       // match digit or :
                        @"(? < time > (d|:)+)s" +
                            // match digit or .
                        @"(? < ip > (d|.)+)s" +
                            // match any non-white
                        @"(? < method > S+)s" +
                            // match any non-white
                        @"(? < uri > S+)s" +
                            // match any non-white
                        @"(? < status > d+)");
                // read through the lines, add an
                // IISLogRow for each line
            while ((line = stream.ReadLine()) != null)
            {
                //Console.WriteLine(line);
                match = regexLine.Match(line);
                if (match.Length != 0)
                {
                   Console.WriteLine("date: {0} {1}", date,
                                     match.Groups["time"]);
                   Console.WriteLine("IP Address: {0}",
                                     match.Groups["ip"]);
                   Console.WriteLine("Method: {0}",
                                     match.Groups["method"]);
                   Console.WriteLine("Status: {0}",
                                     match.Groups["status"]);
                   Console.WriteLine("URI: {0} ",
                                     match.Groups["uri"]);
                }
            }
            f.Close();
       }
    }
}

The general structure of this code should be familiar. There are two regular expressions used in this example. The date string and the regular expression used to match it are as follows:

#Date: 1999-12-31 00:01:22
:s(? < date > [^s]+)s

In the code, regular expressions are usually written using the verbatim string syntax, since the regular expression syntax also uses the backslash character. Regular expressions are most easily read if they are broken down into separate elements. The following matches the colon (:):

:

The backslash () is required because the colon by itself means something else. The following matches a single character of whitespace (tab or space):

s

In the following line, the ? < date > names the value that will be matched so it can be extracted later:

(? < date > [^s]+)

The [^s] is called a character group, with the ^ character meaning “none of the following characters.” This group therefore matches any nonwhitespace character. Finally, the + character means to match one or more occurrences of the previous description (nonwhitespace). The parentheses are used to delimit how to match the extracted string. In the preceding example, this part of the expression matches 1999-12-31.

To match more carefully, the d (digit) specifier could have been used, with the whole expression written as follows:

:s(? < date > dddd-dd-dd)s

That covers the simple regular expression. A more complex regular expression is used to match each line of the log file. Because of the regularity of the line, Split() could also have been used, but that wouldn’t have been as illustrative. The clauses of the regular expression are as follows:

(? < time > (d|:)+)s      // match digit or : to extract time
(? < ip > (d|.)+)s       // match digit or . to get IP address
(? < method > S+)s         // any non-whitespace for method
(? < uri > S+)s            // any non-whitespace for uri
(? < status > d+)          // any digit for status

1 The program is written using the .NET intermediate language—the same one that C# produces as output from a compilation. See Chapter 36 for information on how this works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.128.39