As computer programming has evolved from being primarily concerned with performing complex numeric computations to providing solutions for a broader range of business problems, programming languages have shifted to focus more on string data and the manipulation of such data. String data is simply a logical sequence of individual characters. The System.String
class, which encapsulates the data manipulation, sorting, and searching methods you most commonly perform on strings, enables C# to provide rich support for string data and manipulation.
In C#, string
is an alias for System.String
, so they are equivalent. Use whichever naming convention you prefer, although the common use is to use string
when referring to the data type and String
when accessing static members of the class.
In this hour, you learn to work with strings in C#, including how to manipulate and concatenate strings, extract substrings, and build new strings. After you understand the basics, you learn how to work with regular expressions to perform more complex pattern matching and manipulation.
A string in C# is an immutable sequence of Unicode characters that cannot be modified after creation. Strings are most commonly created by declaring a variable of type string
and assigning to it a quoted string of characters, known as a string literal, as shown here:
string myString = "Now is the time.";
If you have two identical string literals in the same assembly, the runtime only creates one string
object for all instances of that literal within the assembly. This process, called string interning, is used by the C# compiler to eliminate duplicate string literals, saving memory space at runtime and decreasing the time required to perform string comparisons.
String interning can sometimes have unexpected results when comparing string literals using the equality operator:
object obj = "String";
string string1 = "String";
string string2 = typeof(string).Name;
Console.WriteLine(string1 == string2); // true
Console.WriteLine(obj == string1); // true
Console.WriteLine(obj == string2); // false
The first comparison is testing for value equality, meaning it is testing to see if the two strings have the same content. The second and third comparisons use reference equality because you are comparing an object
and a string
. If you were to enter this code in a program, you would see two warnings about a “Possible Unintended Reference Comparison
” that further tells you to “Cast the Left Hand Side to Type 'string'
” to get a value comparison.
Because string interning applies only to literal string values, the value of string2
is not interned because it isn’t a literal. This means that obj
and string2
actually refer to different objects in memory, so the reference equality fails.
These string literals can include special escape sequences to indicate nonprinting characters, such as a tab or new line that begin with the backslash character (). If you want to include the backslash character as part of the string literal, it must also be escaped. Table 9.1 lists the defined C# character escape sequences.
Another option for creating string literals are verbatim string literals, which start with the @
symbol before the opening quote. The benefit of verbatim string literals is that the compiler treats the string exactly as it is written, even if it spans multiple lines or includes escape characters. Only the double-quote character must be escaped, by including two double-quote characters, in verbatim string literals so that the compiler knows where the string ends.
When the compiler encounters a verbatim string literal, it translates that literal in to the properly escaped string literal. Listing 9.1 shows four different strings. The first two declarations are equivalent, although the verbatim string literal is generally easier to read. The second two declarations are also equivalent, where multipleLines2
represents the translated string literal.
string stringLiteral = "C:\Program Files\Microsoft Visual Studio 10\VC#";
string verbatimLiteral = @"C:Program FilesMicrosoft Visual Studio 10VC#";
string multipleLines = @"This is a ""line"" of text.
And this is the second line.";
string multipleLines2 =
"This is a "line" of text.
And this is the second line.";
Strings can also be created by calling the ToString
method. Because ToString
is declared by System.Object
, every object
is guaranteed to have it; although, the default implementation is to simply return the name of the class. All the predefined data types override ToString
to provide a meaningful string representation.
An empty string is different from an unassigned string variable (which is null
) and is a string containing no characters between the quotes (""
).
There is no practical difference between ""
and String.Empty,
so which one you choose ultimately depends on personal preference, although String.Empty
is generally easier to read.
The fastest and simplest way to determine if a string is empty is to test if the Length
property is equal to 0. However, because strings are reference types, it is possible for a string variable to be null
, which would result in a runtime error when you tried to access the Length
property. Because testing to determine if a string is empty is such a common occurrence, C# provides the static method String.IsNullOrEmpty
, shown in Listing 9.2.
public static bool IsNullOrEmpty(string value)
{
if (value != null)
{
return (value.Length == 0);
}
return true;
}
It is also common to consider a string that contains only whitespace characters as an empty string as well. You can use the static String.IsNullOrWhiteSpace
method, shown in Listing 9.3.
public static bool IsNullOrWhiteSpace(string value)
{
if (value != null)
{
for (int i = 0; i < value.Length; i++)
{
if (!char.IsWhiteSpace(value[i]))
{
return false;
}
}
}
return true;
}
Using either String.IsNullOrEmpty
or String.IsNullOrWhiteSpace
helps ensure correctness, readability, and consistency, so they should be used in all situations where you need to determine if a string is null
, empty, or contains only whitespace characters.
The System.String
class provides a rich set of methods and properties for interacting with and manipulating strings. In fact, System.String
defines more than 40 different public members.
Even though strings are a first-class data type and string data is usually manipulated as a whole, a string is still composed of individual characters. You can use the Length
property to determine the total number of characters in the string. Unlike strings in other languages, such as C and C++, strings in C# do not include a termination character. Because strings are composed of individual characters, it is possible to access specific characters by position as if the string were an array of characters.
A substring is a smaller string contained within the larger original value. Several methods provided by System.String
enable you to find and extract substrings.
To extract a substring, the String
class provides an overloaded Substring
method, which enables you to specify the starting character position and, optionally, the length of the substring to extract. If you don’t provide the length, the resulting substring ends at the end of the original string.
The code in Listing 9.4 creates two substrings. The first substring will start at character position 10 and continue to the end of the original string, resulting in the string “brown fox”. The second substring results in the string “quick”.
string original = "The quick brown fox";
string substring = original.Substring(10);
string substring2 = original.Substring(4, 5);
Extracting substrings in this manner is a flexible approach, especially when combined with other methods enabling you to find the position of specific characters within a string.
The IndexOf
and LastIndexOf
methods report the index of the first and last occurrence, respectively, of the specified character or string. If you need to find the first or last occurrence of any character in a given set of characters, you can use one of the IndexOfAny
or LastIndexOfAny
overloads, respectively. If a match is found, the index (or more intuitively, the offset) position of the character or start of the matched string is returned; otherwise, the value –1 is returned. If the string or character you are searching for is empty, the value 0 is returned.
When accessing a string by character position, as the IndexOf
, LastIndexOf
, IndexOfAny
, and LastIndexOfAny
methods do, C# starts counting at 0 not 1. This means that the first character of the string is at index position 0. A better way to think about these methods is that they return an offset from the beginning of the string.
To perform string comparisons to determine if one string is equal to or contains another string, you can use the Compare
, CompareOrdinal
, CompareTo
, Contains
, Equals
, EndsWith
, and StartsWith
methods.
There are 10 different overloaded versions of the static Compare
method, enabling you to control everything from case sensitivity, culture rules used to perform the comparison, starting positions of both strings being compared, and the maximum number of characters in the strings to compare.
Caution: String Comparison Rules
By default, string comparisons using any of the Compare
methods are performed in a case-sensitive, culture-aware manner. Comparisons using the equality (==
) operator are always performed using ordinal comparison rules.
You can also use the static CompareOrdinal
overloads (of which there are only two) if you want to compare strings based on the numeric ordinal values of each character, optionally specifying the starting positions of both strings and the maximum number of characters in the strings to compare.
The CompareTo
method compares the current string with the specified one and returns an integer value indicating whether the current string precedes, follows, or appears in the same position in the sort order as the specified string.
The Contains
method searches using ordinal sorting rules, and enables you to determine if the specified string exists within the current string. If the specified string is found or is an empty string, the method returns true
.
Even though the string comparison methods enable ways to perform string comparisons that are not case sensitive, you can also convert strings to an all-uppercase or all-lowercase representation. This is useful for string comparisons but also for standardizing the representation of string data.
The StartsWith
and EndsWith
methods (there are a total of six) determine if the beginning or ending of the current string matches a specified string. Just as with the Compare
method, you can optionally indicate what culture rules should be used and if the search should be case sensitive.
The standard way to normalize case is to use the ToUpperInvariant
method, which creates an all-uppercase representation of the string using the casing rules of the invariant culture. To create an all-lowercase representation, it is preferred that you use the ToLowerInvariant
method, which uses the casing rules of the invariant culture. In addition to the invariant methods, you can also use the ToUpper
and ToLower
methods, which use the casing rules of either the current culture or the specified culture, depending on which overload you use.
Although performing string comparisons is common, sometimes you need to modify all or part of a string. Because strings are immutable, these methods actually return a new string rather than modifying the current one.
To remove whitespace and other characters from a string, you can use the Trim
, TrimEnd
, or TrimStart
methods. TrimEnd
and TrimStart
remove whitespace from either the end or beginning of the current string, respectively, whereas Trim
removes from both ends.
To expand, or pad, a string to be a specific length, you can use the PadLeft
or PadRight
methods. By default, these methods pad using spaces, but they both have an overload that enables you to specify the padding character to use.
The String
class also provides a set of overloaded methods enabling you to create new strings by removing or replacing characters from an existing string. The Remove
method deletes all the characters from a string starting at a specified character position and either continues through the end of the string or for a specified number of positions. The Replace
method simply replaces all occurrences of the specified character or string with another character or string by performing an ordinal search that is case sensitive but not culture sensitive.
You have already seen several ways to create new strings from string literals and from substrings, but you can also create new strings by combining existing strings in a process called string concatenation.
String concatenation typically occurs two different ways. The most common is to use the overloaded addition (+
) operator to combine two or more strings. You can also use one of the nine different overloads of the Concat
method, which enables you to concatenate an unlimited number of strings. Both of these methods are shown in Listing 9.5.
Tip: String Concatenation Using Addition
The addition operator actually just calls the appropriate overload of the Concat
method.
string string1 = "this is " + " basic concatenation.";
string string2 = String.Concat("this", "is", "more", "advanced", "concatenation");
Closely related to concatenation is the idea of joining strings, which uses the Join
method. Unlike the Concat
method, Join
concatenates a specified separator between each element of a given set of strings.
If joining a string combines a set of strings, the opposite is splitting a string into an undetermined number of substrings based on a delimiting character, accomplished by using the Split
method, which accepts either a set of characters or a set of strings for the delimiting set.
Listing 9.6 shows an example of joining a string and then splitting it based on the same delimiting character. First, an array of 10 strings is created and then joined using #
as the separator character. The resulting string is then split on the same separator character and each word is printed on a separate line.
string[] strings = new string[10];
for (int i = 0; i < 10; i++)
{
strings[i] = String.Format("{0}", i * 2);
}
string joined = String.Join("#", strings);
Console.WriteLine(joined);
foreach (string word in joined.Split(new char[] { '#' }))
{
Console.WriteLine(word);
}
Because strings are immutable, every time you perform any string manipulation, you create new temporary strings. To allow mutable strings to be created and manipulated without creating new strings, C# provides the StringBuilder
class. Using string concatenation is preferred if a fixed number of strings will be concatenated. If an arbitrary number of strings will be concatenated, such as inside an iteration statement, a StringBuilder
is preferred.
StringBuilder
supports appending data to the end of the current string, inserting data at a specified position, replacing data, and removing characters from the current string. Appending data uses one of the overloads of either the Append
or AppendFormat
methods.
The Append
method adds text or a string representation of an object to the end of the current string. The AppendFormat
method supports adding text to the end of the current string by using composite formatting. Because AppendFormat
uses composite formatting, you can pass it a format string, which you learn about in the next section.
Go To
We discuss composite formatting a bit later in this hour.
Listing 9.7 shows the same example of joining and splitting strings shown in Listing 9.6 but uses a StringBuilder
rather than a string
array.
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < 10; i++)
{
stringBuilder.AppendFormat("{0}#", i * 2);
}
// Remove the trailing '#' character.
stringBuilder.Remove(stringBuilder.Length - 1, 1);
string joined = String.Join("#", stringBuilder.ToString());
Console.WriteLine(joined);
foreach (string word in joined.Split(new char[] { '#' }))
{
Console.WriteLine(word);
}
To insert data, one of the Insert
overloads should be used. When you insert data, you must provide the position within the current StringBuilder
string where the insertion will begin. To remove data, you use the Remove
method and indicate the starting position where the removal begins and the number of characters to remove. The Replace
method, or one of its overloads, can be used to replace characters within the current StringBuilder
string with another specified character. The Replace
method also supports replacing characters within a substring of the current StringBuilder
string, specified by a starting position and length.
Internally, the StringBuilder
string is maintained in a buffer to accommodate concatenation. The StringBuilder
needs to allocate additional memory only if the buffer does not have enough room to accommodate the new data.
The default size, or capacity, for this internal buffer is 16 characters. When the buffer reaches capacity, additional buffer space is allocated for an additional amount of characters as specified by the capacity. StringBuilder
also has a maximum capacity, which is Int32.MaxValue
, or 231, characters.
The length of the current string can be set using the Length
property. By setting the Length
to a value that is larger than the current capacity, the capacity is automatically changed. Similarly, by setting the Length
to a value that is less than the current capacity, the current string is shortened.
Formatting allows you to convert an instance of a class, structure, or enumeration value to a string representation. Every type that derives from System.Object
automatically inherits a parameterless ToString
method, which, by default, returns the name of the type. All the predefined value types have overridden ToString
to return a general format for the type.
Because getting the name of the type from ToString
isn’t generally useful, you can override the ToString
method and provide a meaningful string representation of your type. Listing 9.8 shows the Contact
class overriding the ToString
method.
class Contact
{
private string firstName;
private string lastName;
public override string ToString()
{
return firstName + " " + lastName;
}
}
Before you start adding ToString
overrides to all your classes, be aware that the Visual Studio debugging tools use ToString
extensively to determine what values to display for an object when viewed through a debugger.
The value of an object often has multiple representations, and ToString
enables you to pass a format string as a parameter that determines how the string representation should appear. A format string contains one or more format specifiers that define how the string representation should appear.
A standard format string contains a single format specifier, which is a single character that defines a more complete format string, and an optional precision specifier that affects the number of digits displayed in the result. If supported, the precision specifier can be any integer value from 0 to 99. All the numeric types, date and time types, and enumeration types support a set of predefined standard format strings, including a "G"
standard format specifier, which represents a general string representation of the value.
The standard format specifiers are shown in Table 9.2.
Using the standard format strings to format a Days
enumeration value is shown in Listing 9.9.
Days days = Days.Monday;
string[] formats = { "G", "F", "D", "X" };
foreach (string format in formats)
{
Console.WriteLine(days.ToString(format));
}
Just as you can override the ToString
method, you can define standard format specifiers for your own classes as well by defining a ToString(string)
method, which should support the following:
• A "G"
format specifier that represents a common format. Your override of the parameterless ToString
method should simply call ToString(string)
and pass it the "G"
standard format string.
• A format specifier that is equal to a null reference that should be considered equivalent to the "G"
format specifier.
Listing 9.10 shows an updated Celsius
struct from Hour 6, “Creating Enumerated Types and Structures,” that supports format specifiers to represent the value in degrees Fahrenheit and degrees Kelvin.
struct Celsius
{
public float Degrees;
public Celsius(float temperature)
{
this.Degrees = temperature;
}
public override string ToString()
{
return this.ToString("C");
}
public string ToString(string format)
{
if (String.IsNullOrWhiteSpace(format))
{
format = "C";
}
format = format.ToUpperInvariant().Trim();
switch(format)
{
case "C":
return this.Degrees.ToString("N2") + " °C";
case "F":
return (this.Degrees * 9 / 5 + 32).ToString("N2") + " °F";
case "K":
return (this.Degrees + 273.15f).ToString("N2") + " °K";
default:
throw new FormatException();
}
}
}
Custom format strings consist of one or more custom format specifiers that define the string representation of a value. If a format string contains a single custom format specifier, it should be preceded by the percent (%
) symbol so that it won’t be confused with a standard format specifier.
All the numeric types and the date and time types support custom format strings. Many of the standard date and time format strings are aliases for custom format strings. Using custom format strings also provides a great deal of flexibility by enabling you to define your own formats by combining multiple custom format specifiers.
The custom format specifiers are described in Table 9.3.
Listing 9.11 displays a DateTime
instance using two different custom format strings.
DateTime date = new DateTime(2013, 3, 22);
// Displays 3
Console.WriteLine(date.ToString("%M"));
// Displays Monday March 22, 2013
Console.WriteLine(date.ToString("dddd MMMM dd, yyyy"));
You have already seen composite formatting in some of the previous examples using Console.WriteLine
and StringBuilder.AppendFormat
. Methods that use composite formatting accept a composite format string and a list of objects as parameters. A composite format string defines a template consisting of fixed text and indexed placeholders, called format items, which correspond to the objects in the list. Composite formatting does not allow you to specify more format items than there are objects in the list, although you can include more objects in the list than there are format items.
The syntax for a format item is as follows:
{index[,alignment][:formatString]}
The matching curly braces and index
are required.
The index
corresponds to the position of the object it represents in the method’s parameter list. Indexes are zero-based but multiple format items can use the same index, and format items can refer to any object in the list, in any order.
The optional alignment
component indicates the preferred field width. A positive value produces a right-aligned field, whereas a negative value produces a left-aligned field. If the value is less than the length of the formatted string, the alignment
component is ignored.
The formatString
component uses either the standard or custom format strings you just learned. If the formatString
is not specified, the general format specifier "G"
is used instead.
In Listing 9.12, the first format item, {0:D}
, is replaced by the string representation of date
and the second format item {1}
is replaced by the string representation of temp
.
Celsius temp = new Celsius(28);
// Using composite formatting with String.Format.
string result = String.Format("On {0:d}, the high temperature was {1}.",
DateTime.Today, temp);
Console.WriteLine(result);
// Using composite formatting with Console.WriteLine.
Console.WriteLine("On {0:dddd MMM dd, yyyy}, the high temperature was {1}.",
DateTime.Today, temp);
Often referred to as patterns, a regular expression describes a set of strings. A regular expression is applied to a string to find out if the string matches the provided pattern, to return a substring or a collection of substrings, or to return a new string that represents a modification of the original.
Tip: Regular Expression Compatibility
Regular expressions in the .NET Framework are designed to be compatible with Perl 5 regular expressions, incorporating the most popular features of other regular expression implementations, such as Perl and awk, and including features not yet seen in other implementations.
Regular expressions are a programming language in their own right and are designed and optimized for text manipulation by using both literal text characters and metacharacters. A literal character is one that should be matched in the target string, whereas metacharacters inform the regular expression parser, which is responsible for interpreting the regular expression and applying it to the target string, how to behave, so you can think of them as commands. These metacharacters give regular expressions their flexibility and processing power. The common metacharacters used in regular expressions are described in Table 9.4.
Regular expressions are implemented in the .NET Framework by several classes in the System.Text.RegularExpression
namespace that provide support for parsing and applying regular expression patterns and working with capturing groups.
The Regex
class provides the implementation of the regular expression parser and the engine that applies that pattern to an input string. Using this class, you can quickly parse large amounts of text for specific patterns and easily extract and edit substrings.
The Regex
class provides both instance and static members, allowing it to be used two different ways. When you create specific instances of the Regex
class, the expression patterns are not compiled and cached. However, by using the static methods, the expression pattern is compiled and cached. The regular expression engine caches the 15 most recently used static regular expressions by default. You might prefer to use the static methods rather than the equivalent instance methods if you extensively use a fixed set of regular expressions.
When a regular expression is applied to a string using the Match
method of the Regex
class, the first successful match found is represented by an instance of the Match
class. The MatchCollection
contains the set of Matches
found by repeatedly applying the regular expression until the first unsuccessful match occurs.
The Match.Groups
property represents the collection of captured groups in a single match. Each group is represented by the Group
class, which contains a collection of Capture
objects returned by the Captures
property. A Capture
represents the results from a single subexpression match.
One of the most common uses of regular expressions is to validate a string by testing if it conforms to a particular pattern. To accomplish this, you can use one of the overloads for the IsMatch
method. Listing 9.13 shows using a regular expression to validate United States Zip+4 postal codes.
string pattern = @"^d{5}(-d{4})?$";
Regex expression = new Regex(pattern);
Console.WriteLine(expression.IsMatch("90210")); // true
Console.WriteLine(expression.IsMatch("00364-3276")); // true
Console.WriteLine(expression.IsMatch("3361")); // false
Console.WriteLine(expression.IsMatch("0036-43275")); // false
Console.WriteLine(expression.IsMatch("90210-")); // false
Regular expressions can also be used to search for substrings that match a particular regular expression pattern. This searching can be performed once, in which case the first occurrence is returned, or it can be performed repeatedly, in which case a collection of occurrences is returned.
Searching for substrings in this manner uses the Match
method to find the first occurrence matching the pattern or the Matches
method to return a sequence of successful nonoverlapping matches.
Continuing to move further away from the foundational aspects of programming, in this hour, you learned how C# enables you to work with string data. This included how to perform string comparisons, create mutable strings, and how to use regular expressions.
Q. Are strings immutable?
A. Yes, strings in C# are immutable, and any operation that modifies the content of a string actually creates a new string with the changed value.
Q. What does the @ symbol mean when it precedes a string literal?
A. The @
symbol is the verbatim string symbol and causes the C# compiler to treat the string exactly as it is written, even if it spans multiple lines or includes special characters.
Q. What are the common string manipulation functions supported by C#?
A. C# supports the following common string manipulations:
• Determining the string length
• Trimming and padding strings
• Creating substrings, concatenating strings, and splitting strings based on specific characters
• Removing and replacing characters
• Performing string containment and comparison operations
• Converting string case
Q. What is the benefit of the StringBuilder
class?
A. The StringBuilder
class enables you to create and manipulate a string that is mutable and is most often used for string concatenation inside a loop.
Q. What are regular expressions?
A. Regular expressions are a pattern that describes a set of strings that are optimized for text manipulation.
1. What is string interning and why is it used?
2. Using a verbatim string literal, must an embedded double-quote character be escaped?
3. What is the recommended way to test for an empty string?
4. What is the difference between the IndexOf
and IndexOfAny
methods?
5. Do any of the string manipulation functions result in a new string being created?
6. What will the output of the following statement be and why?
int i = 10;
Console.WriteLine(i.ToString("P"));
7. What will the output of the following statement be and why?
DateTime today = new DateTime(2009, 8, 23);
Console.WriteLine(today.ToString("MMMM"));
8. What will the output of the following Console.WriteLine
statements be?
Console.WriteLine("¦{0}¦", 10);
Console.WriteLine("¦{0, 3}¦", 10);
Console.WriteLine("¦{0:d4}¦", 10);
int a = 24;
int b = -24;
Console.WriteLine(a.ToString("##;(##)"));
Console.WriteLine(b.ToString("##;(##)"));
9. What is the benefit of using a StringBuilder
for string concatenation inside a loop?
10. What does the following regular expression pattern mean?
[w-]+@([w-]+.)+[w-]+
1. String interning is used by the C# compiler to eliminate duplicate string literals to save space at runtime.
2. The double-quote character is the only character that must be escaped using a verbatim string literal so that the compiler can determine where the string ends.
3. The recommended way to test for an empty string is to use the static String.IsNullOrEmpty
method.
4. The IndexOf
method reports the index of the first occurrence found of a single character or string, whereas the IndexOfAny
method reports the first occurrence found of any character in a set of characters.
5. Yes, because strings are immutable, all the string manipulation functions result in a new string being created.
6. The output will be “1,000.00 %” because the "P"
numeric format specifier results in the number being multiplied by 100 and displayed with a percent symbol.
7. The "MMMM"
custom date and time format specifier represents the full name of the month, so the output will be “August”.
8. The output will be as follows:
¦10¦
¦ 10¦
¦0010¦
24
(24)
9. Because the StringBuilder
represents a mutable string, using it for string concatenation inside a loop prevents multiple temporary strings from being created during each iteration to perform the concatenation.
10. This is a simple regular expression for parsing a string as an email address. Broken down, it means “Match any word character one or more times followed by the @ character followed by a group containing any word character one or more times followed by a period (.) character, where that group is repeated one or more times, followed by any word character one or more times.”
1. Create a new console application and implement the Celsius
struct shown in Listing 9.10. In the Main
method, implement the code shown in Listing 9.12.
2. Create a new console application and in the Main
method implement the code shown in Listing 9.13. Then implement a Validate
method that is called from the Main
method. The Validate
method should use the static RegEx.IsMatch
method to validate a string parameter as a phone number. The necessary regular expression pattern to match a phone number in the form of 555-555-5555 should look like:
^[2-9]d{2}-d{3}-d{4}$
3.144.21.158