© Ray Lischner 2020
R. LischnerExploring C++20https://doi.org/10.1007/978-1-4842-5961-0_59

59. International Characters

Ray Lischner1 
(1)
Ellicott City, MD, USA
 

Explorations 17–19 discussed characters, but only hinted at bigger things to come. Exploration 58 started to examine these bigger issues with locales and facets. The next topic to explore is how to wrangle character sets and character encodings in an international setting.

This Exploration introduces wide characters, which are like ordinary (or narrow) characters, except that they usually occupy more memory. This means the wide character type can potentially represent many more characters than plain char. During your exploration of wide characters, you will also get to know more about Unicode.

Why Wide?

As you saw in Exploration 18, the meaning of a particular character value depends on the locale and character set. For instance, in one locale, you can handle Greek characters, while in another locale, Cyrillic, depending on the character set. Your program needs to know the locale and the character set in order to determine which characters are letters, which are punctuation, which are uppercase or lowercase, and how to convert uppercase to lowercase and vice versa.

What if your program has to handle Cyrillic and Greek? What if this program needs to handle them both at the same time? And what about Asian languages? Chinese does not use a Western-style alphabet but instead uses thousands of distinct ideographs. Several Asian languages have adopted some Chinese ideographs for their own use. The typical implementation of the char type reaches its limit at only 256 distinct characters, which is woefully inadequate for international demands.

In other words, you can’t use plain char and string types if you want to support the majority of the world’s population and their languages. C++ solves this problem with wide characters, which it represents using several types: wchar_t, char16_t, and char32_t. (Unlike C’s definition of wchar_t, the type names in C++ are reserved keywords and built-in types, not typedefs.) The intent is that wchar_t is a native type that can represent characters that don’t fit into a char. With larger characters, a program can support Asian character sets, for example. The char16_t and char32_t are Unicode types. The type char8_t is also for Unicode but is a narrow character type. The Exploration begins by examining wchar_t.

Using Wide Characters

In true C++ fashion, the size and other characteristics of wchar_t are left to the implementation. The only guarantees are that wchar_t is at least as big as char and that wchar_t is the same size as one of the built-in integer types. The <cwchar> header declares a typedef, std::wint_t, for that built-in type. In some implementations, wchar_t may be identical to char, but most desktop and workstation environments use 16 or 32 bits for wchar_t.

Dig up Listing 26-2 and modify it to reveal the size of wchar_t and wint_t in your C++ environment. How many bits are in wchar_t? ________________ How many are in wint_t? ________________ They should be the same number. How many bits are in char? ________________

Wide string objects use the std::wstring type (declared in <string>). A wide string is a string composed of wide characters. In all other ways, wide strings and narrow strings behave similarly; they have the same member functions, and you use them the same way. For example, the size() member function returns the number of characters in the string, regardless of the size of each character.

Wide character and string literals look like their narrow equivalents, except that they start with a capital L and they contain wide characters. The best way to express a wide character in a character or string literal is to specify the character’s hexadecimal value with the x escape (introduced in Exploration 17). Thus, you have to know the wide character set that your C++ environment uses, and you have to know the numeric value of the desired character in that character set. If your editor and compiler permit it, you may be able to write wide characters directly in a wide character literal, but your source code will not be portable to other environments. You can also write a narrow character in a wide character or string literal, and the compiler automatically converts the narrow characters to wide ones, as shown here:
wchar_t capital_a{'A'};        // the compiler automatically widens narrow characters
std::wstring ray{L"Ray"};
wchar_t pi{L'π'};              // if your tools let you type π as a character
wchar_t pi_unicode{L'x03c0'}; // if wchar_t uses a Unicode encoding, such as UTF-32
std::wstring price{L"x20ac" L"12345"};           // Unicode Euro symbol: €12345

Notice how in the last line of the example I divided the string into two parts. Recall from Exploration 17 that the x escape starts an escape sequence that specifies a character by its value in hexadecimal (base 16). The compiler collects as many characters as it can that form a valid hexadecimal number—that is, digits and the letters A through F (in uppercase or lowercase). It then uses that numeric value as the representation of a single character. If the last line were left as one string, the compiler would try to interpret the entire string as the x escape. This means the compiler would think the character value is the hexadecimal value 20AC1234516. By separating the strings, the compiler knows when the x escape ends, and it compiles the character value 20AC16, followed by the characters 1, 2, 3, 4, and 5. Just like narrow strings, the compiler assembles adjacent wide strings into a single wide string. (You are not allowed to place narrow and wide strings next to each other, however. Use all wide strings or all narrow strings, not a mixture of the two.)

Wide Strings

Everything you know about string also applies to wstring. They are just instances of a common template, basic_string. The <string> header declares string to be a typedef for basic_string<char> and wstring as a typedef for basic_string<wchar_t>. The magic of templates takes care of the details.

Because the underlying implementation of string and wstring is actually a template, any time you write some utility code to work with strings, you should consider making that code a template too. For example, suppose you want to rewrite the is_palindrome function (from Listing 22-5) so that it operates with wide characters. Instead of replacing char with wchar_t, let’s turn it into a function template. Begin by rewriting the supporting functions to be function templates, taking a character type as a template argument. Rewrite the supporting functions for is_palindrome so that they function with narrow and wide strings and characters. My solution is presented in Listing 59-1.
import <locale>;
template<class Char>
auto const& ctype{ std::use_facet<std::ctype<Char>>(std::locale()) };
/** Test for non-letter.
 * @param ch the character to test
 * @return true if @p ch is not a letter
 */
template<class Char>
bool isletter(Char ch)
{
  return ctype<Char>.is(std::ctype_base::alpha, ch);
}
/** Convert to lowercase.
 * @param ch the character to test
 * @return the character converted to lowercase
 */
template<class Char>
Char lowercase(Char ch)
{
  return ctype<Char>.tolower(ch);
}
/** Compare two characters without regard to case. */
template<class Char>
bool same_char(Char a, Char b)
{
  return lowercase(a) == lowercase(b);
}
Listing 59-1.

Supporting Cast for the is_palindrome Function Template

The next task is to rewrite is_palindrome itself. The basic_string template actually takes three template arguments, and basic_string_view takes two. The first is the character type, and the next two are details that needn’t concern us at this time. All that matters is that if you want to templatize your own function that deals with strings, you should handle all three of the template parameters.

Before starting, however, you must be aware of a minor hurdle when dealing with functions as arguments to standard algorithms: the argument must be a real function, not the name of a function template. In other words, if you have to work with function templates, such as lowercase and non_letter, you must instantiate the template and pass the template instance. When you pass non_letter and same_char to the remove_if and equal algorithms, be sure to pass the correct template argument too. If Char is the template parameter for the character type, use non_letter<Char> as the functor argument to remove_if.

Rewrite the is_palindrome function as a function template with two template parameters. The first template parameter is the character type: call it Char. Call the second template parameter Traits. You will have to use both arguments to the std::basic_string_view template. Listing 59-2 shows my version of the is_palindrome function, converted to a template, so that it can handle narrow and wide strings.
import <ranges>;
import <string_view>;
/** Determine whether @p str is a palindrome.
 * Only letter characters are tested. Spaces and punctuation don't count.
 * Empty strings are not palindromes because that's just too easy.
 * @param str the string to test
 * @return true if @p str is the same forward and backward
 */
template<class Char, class Traits>
bool is_palindrome(std::basic_string_view<Char, Traits> str)
{
  auto letters_only{ str | std::views::filter(isletter<Char>) };
  auto reversed{ letters_only | std::ranges::views::reverse };
  return std::equal(
    std::ranges::begin(letters_only), std::ranges::end(letters_only),
    std::ranges::begin(reversed),     std::ranges::end(reversed),
    same_char<Char>);
}
Listing 59-2.

Changing is_palindrome to a Function Template

The is_palindrome function never uses the Traits template parameter, except to pass it along to basic_string_view. If you’re curious about that parameter, consult a language reference, but be warned that it’s a bit advanced.

Calling is_palindrome is easy, because the compiler uses automatic type deduction to determine whether you are using narrow or wide strings and instantiates the templates accordingly. Thus, the caller doesn’t have to bother with templates at all.

Without further ado, the isletter and lowercase function templates work with wide character arguments. That’s because locales are templates, parameterized on the character type, just like the string and I/O class templates.

However, in order to use wide characters, you do have to perform I/O with wide characters, which is the subject of the next section.

Wide Character I/O

You read wide characters from the standard input by reading from std::wcin. Write wide characters by writing to std::wcout or std::wcerr. Once you read or write anything to or from a stream, the character width of the corresponding narrow and wide streams is fixed, and you cannot change it—you must decide whether to use narrow or wide characters and stay with that choice for the lifetime of the stream. So, a program must use cin or wcin, but not both. Ditto for the output streams. The <iostream> header declares the names of all the standard streams, narrow and wide. The <istream> header defines all the input stream classes and operators; <ostream> defines the output classes and operators. More precisely, <istream> and <ostream> define templates, and the character type is the first template parameter.

The <istream> header defines the std::basic_istream class template, parameterized on the character type. The same header declares two typedefs, as follows:
using istream = basic_istream<char>;
using wistream = basic_istream<wchar_t>;

As you can guess, the <ostream> header is similar, defining the basic_ostream class template and the ostream and wostream typedefs.

The <fstream> header follows the same pattern—basic_ifstream and basic_ofstream are class templates, with typedefs, as in the following:
using ifstream  = basic_ifstream<char>;
using wifstream = basic_ifstream<wchar_t>;
using ofstream  = basic_ofstream<char>;
using wofstream = basic_ofstream<wchar_t>;
Rewrite the main program from Listing 22-5 to test the is_palindrome function template with wide character I/O. Modern desktop environments should be able to support wide characters, but you may have to learn some new features to figure out how to get your text editor to save a file with wide characters. You may also have to load some additional fonts. Most likely, you can supply an ordinary, narrow-text file as input, and the program will work just fine. If you’re having difficulty finding a suitable input file, try the palindrome files that you can download with the other examples in this book. The file names indicate the character set. For example, palindrome-utf8.txt contains UTF-8 input. You have to determine what format your C++ environment expects when reading a wide stream and pick the correct file. My solution is shown in Listing 59-3.
int main()
{
  std::locale::global(std::locale{""});
  std::wcin.imbue(std::locale{});
  std::wcout.imbue(std::locale{});
  std::wstring line{};
  while (std::getline(std::wcin, line))
    if (is_palindrome(std::wstring_view{line}))
      std::wcout << line << L' ';
}
Listing 59-3.

The main Program for Testing is_palindrome

Reading wide characters from a file or writing wide characters to a file is different from reading or writing narrow characters. All file I/O passes through an additional step of character conversion. C++ always interprets a file as a series of bytes. When reading or writing narrow characters, the conversion of a byte to a narrow character is a no-op, but when reading or writing wide characters, the C++ library has to interpret the bytes to form wide characters. It does so by accumulating one or more adjacent bytes to form each wide character. The rules for deciding which bytes are elements of a wide character and how to combine the characters are specified by the encoding rules for a multi-byte character set.

Multi-byte Character Sets

Multi-byte character sets originated in Asia, where demand for characters exceeded the few character slots available in a single-byte character set, such as ASCII. European nations managed to fit their alphabets into 8-bit character sets, but languages such as Chinese, Japanese, Korean, and Vietnamese require far more bits to represent thousands of ideographs, syllables, and native characters.

The requirements of Asian languages spurred the development of character sets that used two bytes to encode a character—hence the common term double-byte character set (DBCS) , with the generalization to multi-byte character sets (MBCS). Many DBCSes were invented, and sometimes a single character had multiple encodings. For example, in Chinese Big 5, the ideograph 丁 has the double-byte value "xA4x42". In the EUC-KR character set (which is popular in Korea), the same ideograph has a different encoding: "xEFxCB".

The typical DBCS uses characters with the most significant bit set (in an 8-bit byte) to represent double characters. Characters with the most significant bit clear would be taken from a single-byte character set (SBCS) . Some DBCSes mandate a particular SBCS; others leave it open, so you get different conventions for different combinations of DBCS and SBCS. Mixing single- and double-byte characters in a single character stream is necessary to represent the common use of character streams that mix Asian and Western text. Working with multi-byte characters is more difficult than working with single-byte characters. A string’s size() function, for example, doesn’t tell you how many characters are in a string. You must examine every byte of the string to learn the number of characters. Indexing into a string is more difficult, because you must take care not to index into the middle of a double-byte character.

Sometimes a single character stream needs more flexibility than simply switching between one particular SBCS and one particular DBCS. Sometimes the stream has to mix multiple double-byte character sets. The ISO 2022 standard is an example of a character set that allows shifting between other, subsidiary character sets. Shift sequences (also called escape sequences, not to be confused with C++ backslash escape sequences) dictate which character set to use. For example, ISO 2022-JP is widely used in Japan and allows switching between ASCII, JIS X 0201 (a SBCS), and JIS X 0208 (a DBCS). Each line of text begins in ASCII, and a shift sequence changes character sets mid-string. For example, the shift sequence "x1B$B" switches to JIS X 0208-1983.

Seeking to an arbitrary position in a file or text stream that contains shift sequences is clearly problematic. A program that has to seek in a multi-byte text stream must keep track of shift sequences in addition to stream positions. Without knowing the most recent shift sequence in the stream, a program has no way of knowing which character set to use to interpret the subsequent characters.

A number of variations on ISO 2022-JP permit additional character sets. The point here is not to offer a tutorial on Asian character sets but to impress on you the complexities of writing a truly open, general, and flexible mechanism that can support the world’s rich diversity in character sets and locales. These and similar problems gave rise to the Unicode project.

Unicode

Unicode is an attempt to get out of the whole character-set mess by unifying all major variations into one, big, happy character set. To a large degree, the Unicode Consortium has succeeded. The Unicode character set has been adopted as an international standard as ISO 10646. However, the Unicode project includes more than just the character set; it also specifies rules for case-folding, character collation, and more.

Unicode provides 1,114,112 possible character values (called code points). So far, the Unicode Consortium has assigned about 100,000 code points to characters, so there’s plenty of room for expansion. The simplest way to represent a million code points is to use a 32-bit integer, and indeed, this is a common encoding for Unicode. It is not the only encoding, however. The Unicode standard also defines encodings that let you represent a code point using one or two 16-bit integers and one to four 8-bit integers.

The standard way to denote a Unicode code point is U+, followed by the code point as a hexadecimal number of at least four places. Thus, 'x41' is the C++ encoding of U+0041 (Latin capital A) and Greek π has code point U+03C0. A musical eighth note (♪) has code point U+266A or U+1D160; the former code point is in a group of miscellaneous symbols, which happens to include an eighth note. The latter code point is part of a group of musical symbols, which you will need for any significant work with music-related characters.

UTF-32 is the name of the encoding that stores a code point as a 32-bit integer. To represent a UTF-32 code point in C++, preface the character literal with U (uppercase letter U). Such a character literal has type char32_t. For example, to represent the letter A, use U'A'; for a lowercase Greek π, use U'x03c0'; and for a musical eighth note (♪), use U'x266a' or U'x1d160'. Do the same for a character string literal, and the standard library defines the type std::u32string for a string of char32_t. For example, to represent the characters π ≈ 3.14, use the following:
std::u32string pi_approx_3_14{ U"x03c0 x2248 3.14" };

Another common encoding for Unicode uses one to four 8-bit units to make up a single code point. Common characters in Western European languages can usually be represented in a single byte, and many other characters take only two bytes. Less common characters require three or four. The result is an encoding that supports the full range of Unicode code points and almost always consumes less memory than other encodings. This character set is called UTF-8. UTF-8 characters are written in the manner of ordinary character literals prefaced with u8. The type of a UTF-8 string literal is char8_t. A UTF-8 string has type std::u8string .

Representing a Greek letter π requires only two bytes, but with different values than the two low-order bytes in UTF-32: u8"xcfx80". An eighth note (♪) requires three or four bytes, again with a different encoding than that used in UTF-32: u8"xe2x99xaa" or u8"xf0x9dx85xa0".

The primary difficulty when dealing with UTF-8 in a program is that the only way to know how many code points are in a string is to scan the entire string. The size() member function returns the number of storage units in the string, but each code point requires one to four storage units. On the other hand, UTF-8 has the advantage that you can seek to an arbitrary position in a UTF-8 byte stream and know whether that position is in the middle of a multi-byte character because multi-byte characters always have their most significant bitset. By examining the encoding, you can tell whether a byte is the first byte of a multi-byte character or a following byte.

UTF-8 is a common encoding for files and network transmissions. It has become the de facto standard for many desktop environments, word processors (including the one I am using to write this book), web pages, and other everyday applications.

Some other environments use UTF-16, which represents a code point using one or two 16-bit integers. The C++ type for a UTF-16 character literal is char16_t, and the string type is std::u16string. Write such a character literal with the u prefix (lowercase letter u), for example, u'x03c0'.

Unicode’s designers kept the most common code points in the lower 16-bit region (called the Basic Multilingual Plane, or BMP). When a code point is outside the BMP, that is, its value exceeds U+FFFF, it requires two storage units in UTF-16 and is called a surrogate pair . For example, 丁 requires two 16-bit storage units: u"xD834xDD1E".

Thus, you have the same problem as UTF-8, namely, that one storage unit does not necessarily represent a single code point, so UTF-16 is less than ideal as an in-memory representation. But the vast majority of code points that most programs deal with fit in a single UTF-16 storage unit, so UTF-16 usually requires half the memory as UTF-32, and in many cases, a u16string’s size() is the number of code points in the string (although you can’t be sure without scanning the string).

Some programmers cope with the difficulty of working with UTF-16 by ignoring surrogate pairs completely. They assume that size() does indeed return the number of code points in the string, so their programs work correctly only if all code points are from the BMP. This means you lose access to ancient scripts, specialized alphabets and symbols, and infrequently used ideographs.

UTF-8 has an advantage over UTF-16 and UTF-32 encodings for external representations, because you don’t have to deal with endianness. The Unicode standard defines a mechanism for encoding and revealing the endianness of a stream of UTF-16 or UTF-32 text, but that just makes extra work for you.

Note

The position of the most significant byte is called “endianness.” A “big-endian” platform is one with the most significant byte first. A “little-endian” platform puts the least significant byte first. The popular Intel x86 platform is little-endian.

Universal Character Names

Unicode makes another official appearance in the C++ standard. You can specify a character using its Unicode code point. Use uXXXX or UXXXXXXXX, replacing XXXX or XXXXXXXX with the hexadecimal code point. Unlike the x escape, you must use exactly four hexadecimal digits with u or eight with U. These character constructs are called universal character names.

Thus, a better way to encode international characters in a string is to use a universal character name. This helps to insulate you against vagaries in the native character set. On the other hand, you have no control over the compiler’s actions if it cannot map a Unicode code point to a native character. Therefore, if your native character set is ISO 8859-7 (Greek), the following code should initialize the variable pi with the value 'xf0', but if your native character set is ISO 8859-1 (Latin-1), the compiler cannot map it and so might give you a space or a question mark, or the compiler may refuse to compile it:
char pi{'u03c0'};

Also note that u and U are not escape sequences (unlike x). You can use them anywhere in a program, not only in a character or string literal. Using a Unicode character name lets you use UTF-8 and UTF-16 strings without knowing the encoding details. Thus, a better way to write the UTF-8 string for Greek lowercase π is u8"u03c0", and the compiler will store the encoded bytes "xcfx80".

If you are fortunate, you will be able to avoid universal character names. Instead, your tools will let you edit Unicode characters directly. Instead of dealing with Unicode encoding issues, the editor simply reads and writes universal character names. Thus, the programmer edits WYSIWYG international text, and the source code retains maximum portability. Because universal character names are allowed anywhere, you can use international text in comments too. If you really want to have fun, try using international letters in identifier names. Not all compilers support this feature, although the standard requires it. Thus, you would write a declaration
double π{3.14159265358979};
and your smart editor would store the following in the source file:
double u03c0{3.14159265358979};

and your standard-compliant compiler would accept it and let you use π as an identifier. I don’t recommend using extended characters in identifiers unless you know that everyone reading your code is using tools that are aware of universal character names. Otherwise, they make the code much harder to read, understand, and maintain.

Does your compiler support universal character names in strings? ________________ Does your compiler support universal character names in identifiers? ________________

Unicode Difficulties

For all the seeming benefits of Unicode, C++ support remains minimal. Although you can write Unicode character literals and string literals, the standard library offers no useful support. Try this exercise: modify the palindrome program to use char32_t instead of wchar_t. What happens?
  • _____________________________________________________________

It doesn’t work. There are no I/O stream classes for Unicode. Template specializations for isalnum and so on don’t exist for char8_t, char16_t, or char32_t. Although the standard library offers some functions for converting Unicode strings to and from wstring, the support ends there.

If you have to work with international characters in any meaningful way, you need a third-party library. The most widely used library is International Components for Unicode (ICU). See the book’s website for a current link.

The next and final topic in Part 3 is to further your understanding of text I/O.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.71.72