Encoding text

Text characters can be represented in different ways. For example, the alphabet can be encoded using Morse code into a series of dots and dashes for transmission over a telegraph line.

In a similar way, text inside a computer is stored as bits (ones and zeros). .NET Core uses a standard called Unicode to encode text internally. Sometimes, you will need to move text outside .NET Core for use by systems that do not use Unicode or use a variation of Unicode.

The following table lists some alternative text encodings commonly used by computers:

Encoding

Description

ASCII

This encodes a limited range of characters using the lower seven bits of a byte

UTF-8

This represents each Unicode code point as a sequence of one to four bytes

UTF-16

This represents each Unicode code point as a sequence of one or two 16-bit integers

ANSI/ISO encodings

This provides support for a variety of code pages that are used to support a specific language or group of languages

Encoding strings as byte arrays

Add a new console application project named Ch10_Encoding.

Import the System.Text namespace and add the following statements to the Main method. The code encodes a string using the chosen encoding, loops through each byte, and then decodes back into a string and outputs it:

    WriteLine("Encodings"); 
    WriteLine("[1] ASCII"); 
    WriteLine("[2] UTF-7"); 
    WriteLine("[3] UTF-8"); 
    WriteLine("[4] UTF-16 (Unicode)"); 
    WriteLine("[5] UTF-32"); 
    WriteLine("[any other key] Default"); 
 
    // choose an encoding 
    Write("Press a number to choose an encoding: "); 
    ConsoleKey number = ReadKey(false).Key; 
    WriteLine(); 
    WriteLine(); 
 
    Encoding encoder; 
    switch (number) 
    { 
      case ConsoleKey.D1: 
        encoder = Encoding.ASCII; 
        break; 
      case ConsoleKey.D2: 
        encoder = Encoding.UTF7; 
        break; 
      case ConsoleKey.D3: 
        encoder = Encoding.UTF8; 
        break; 
      case ConsoleKey.D4: 
        encoder = Encoding.Unicode; 
        break; 
      case ConsoleKey.D5: 
        encoder = Encoding.UTF32; 
        break; 
      default: 
        encoder = Encoding.GetEncoding(0); 
        break; 
    } 
 
    // define a string to encode 
    string message = "A pint of milk is £1.99"; 
 
    // encode the string into a byte array 
    byte[] encoded = encoder.GetBytes(message); 
 
    // check how many bytes the encoding needed 
    WriteLine($"{encoder.GetType().Name} uses {encoded.Length}
    bytes."); 
 
    // enumerate each byte 
    WriteLine($"Byte  Hex  Char"); 
    foreach (byte b in encoded) 
    { 
      WriteLine($"{b,4} {b.ToString("X"),4} {(char)b,5}"); 
    } 
 
    // decode the byte array back into a string and display it 
    string decoded = encoder.GetString(encoded); 
    WriteLine(decoded); 

Run the application and press 1 to choose ASCII. Notice that when outputting the bytes, the pound sign (£) cannot be represented in ASCII, so it uses a question mark (?) instead:

Encodings
[1] ASCII
[2] UTF-7
[3] UTF-8
[4] UTF-16 (Unicode)
[5] UTF-32
[any other key] Default
Press a number to choose an encoding: 1
ASCIIEncoding uses 23 bytes.
Byte  Hex  Char
65   41     A
32   20
112   70     p
105   69     i
110   6E     n
 116   74     t
 32   20
 111   6F     o
102   66     f
32   20
109   6D     m
105   69     i
108   6C     l
107   6B     k
32   20
105   69     i
115   73     s
 32   20
 63   3F     ?
49   31     1
 46   2E     .
 57   39     9
 57   39     9
A pint of milk is ?1.99

Rerun the application and press 3 to choose UTF-8. Notice that, UTF-8 requires one extra byte (24 bytes instead of 23 bytes), but it can store the £:

UTF8Encoding uses 24 bytes.
Byte  Hex  Char
65   41     A
32   20
112   70     p
 105   69     i
 110   6E     n
116   74     t
 32   20
 111   6F     o
102   66     f
32   20
109   6D     m
105   69     i
 108   6C     l
107   6B     k
 32   20
 105   69     i
 115   73     s
 32   20
 194   C2     Â
 163   A3     £
49   31     1
 46   2E     .
57   39     9
 57   39     9
A pint of milk is £1.99

Rerun the application and press 4 to choose Unicode (UTF-16). Notice that UTF-16 requires two bytes for every character, but it can store the £:

UnicodeEncoding uses 46 bytes.

Encoding and decoding text in files

When using stream helper classes, such as StreamReader and StreamWriter, you can specify the encoding you want to use. As you write to the helper, the strings will automatically be encoded, and as you read from the helper, the bytes will be automatically decoded. This is how you can specify the encoding:

    var reader = new StreamReader(stream, Encoding.UTF7); 
    var writer = new StreamWriter(stream, Encoding.UTF7); 

Tip

Good Practice

Often, you won't have a choice of encoding to use, because you will be generating a file for use by another system. However, if you do, pick one that uses the least number of bytes but can store every character you need.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.235.62