The System.Text namespace contains classes that deal with converting from one character encoding to another. The input is assumed primarily to be Unicode. The output varies with the encoding class that is selected. Listing B.59 shows an example of using the UTF8 encoding class (encoding.cs).
byte [] encoded; UTF8Encoding encoding = new UTF8Encoding(); Console.WriteLine("CodePage: {0} ", encoding.CodePage); Console.WriteLine("EncodingName: {0} ", encoding.EncodingName); Console.WriteLine("WindowsCodePage: {0} ", encoding.WindowsCodePage); encoded = encoding.GetBytes(japanese); Console.WriteLine("Encoded Japanese: {0} <-> {1} ", japanese.Length, encoded.Length); encoded = encoding.GetBytes(chinese); Console.WriteLine("Encoded Chinese: {0} <-> {1} ", chinese.Length, encoded.Length); encoded = encoding.GetBytes(english); Console.WriteLine("Encoded English: {0} <-> {1} ", english.Length, encoded.Length); |
This sample contains three strings: one Japanese, one Chinese, and one English. They are all stored as a .NET string. The UTF8 encoder takes these strings in and outputs a sequence of bytes corresponding to the UTF8 representation of these strings. Listing B.60 shows the output of Listing B.59.
CodePage: 65001 EncodingName: Unicode (UTF-8) WindowsCodePage: 1200 Encoded Japanese: 7 <-> 21 Encoded Chinese: 4 <-> 12 Encoded English: 12 <-> 12 |
What was 7 Japanese characters (14 bytes) turned into 21 UTF-8 bytes? What was 4 Chinese characters (8 bytes) turned into 12 UTF-8 bytes? And what was 12 English characters (24 bytes) turned into 12 UTF-8 bytes? Clearly, for Japanese and Chinese, you cannot assume that just two bytes (16-bits) need to represent a character in UTF8. For English, the encoding actually decreased the size of the required bytes by exactly one-half.
The RegularExpressions class (and the associated support classes) in .NET has taken a great stride forward in providing additional ease of use and functionality to traditional regular expression processing. Listing B.61 shows how to use one regular expression to split apart the components of a file path (regex.cs).
Listing B.62 shows the output from the code in Listing B.61.
Success in parsing "c:acde.cs" !! Drive: c: Directories: 4 a starts at character 3 b starts at character 5 c starts at character 7 d starts at character 9 File: e.cs Base: e Extension: cs |
18.191.5.239