Explorations 17–19 discussed characters, but only hinted at bigger things to come. Exploration 58 started to examine these bigger issues with locales and facets. The next topic to explore is how to wrangle character sets and character encodings in an international setting.
This Exploration introduces wide characters, which are like ordinary (or narrow) characters, except that they usually occupy more memory. This means the wide character type can potentially represent many more characters than plain char. During your exploration of wide characters, you will also get to know more about Unicode.
Why Wide?
As you saw in Exploration 18, the meaning of a particular character value depends on the locale and character set. For instance, in one locale, you can handle Greek characters, while in another locale, Cyrillic, depending on the character set. Your program needs to know the locale and the character set in order to determine which characters are letters, which are punctuation, which are uppercase or lowercase, and how to convert uppercase to lowercase and vice versa.
What if your program has to handle Cyrillic and Greek? What if this program needs to handle them both at the same time? And what about Asian languages? Chinese does not use a Western-style alphabet but instead uses thousands of distinct ideographs. Several Asian languages have adopted some Chinese ideographs for their own use. The typical implementation of the char type reaches its limit at only 256 distinct characters, which is woefully inadequate for international demands.
In other words, you can’t use plain char and string types if you want to support the majority of the world’s population and their languages. C++ solves this problem with wide characters, which it represents using several types: wchar_t, char16_t, and char32_t. (Unlike C’s definition of wchar_t, the type names in C++ are reserved keywords and built-in types, not typedefs.) The intent is that wchar_t is a native type that can represent characters that don’t fit into a char. With larger characters, a program can support Asian character sets, for example. The char16_t and char32_t are Unicode types. The type char8_t is also for Unicode but is a narrow character type. The Exploration begins by examining wchar_t.
Using Wide Characters
In true C++ fashion, the size and other characteristics of wchar_t are left to the implementation. The only guarantees are that wchar_t is at least as big as char and that wchar_t is the same size as one of the built-in integer types. The <cwchar> header declares a typedef, std::wint_t, for that built-in type. In some implementations, wchar_t may be identical to char, but most desktop and workstation environments use 16 or 32 bits for wchar_t.
Dig up Listing 26-2 and modify it to reveal the size of wchar_t and wint_t in your C++ environment. How many bits are in wchar_t? ________________ How many are in wint_t? ________________ They should be the same number. How many bits are in char? ________________
Wide string objects use the std::wstring type (declared in <string>). A wide string is a string composed of wide characters. In all other ways, wide strings and narrow strings behave similarly; they have the same member functions, and you use them the same way. For example, the size() member function returns the number of characters in the string, regardless of the size of each character.
Notice how in the last line of the example I divided the string into two parts. Recall from Exploration 17 that the x escape starts an escape sequence that specifies a character by its value in hexadecimal (base 16). The compiler collects as many characters as it can that form a valid hexadecimal number—that is, digits and the letters A through F (in uppercase or lowercase). It then uses that numeric value as the representation of a single character. If the last line were left as one string, the compiler would try to interpret the entire string as the x escape. This means the compiler would think the character value is the hexadecimal value 20AC1234516. By separating the strings, the compiler knows when the x escape ends, and it compiles the character value 20AC16, followed by the characters 1, 2, 3, 4, and 5. Just like narrow strings, the compiler assembles adjacent wide strings into a single wide string. (You are not allowed to place narrow and wide strings next to each other, however. Use all wide strings or all narrow strings, not a mixture of the two.)
Wide Strings
Everything you know about string also applies to wstring. They are just instances of a common template, basic_string. The <string> header declares string to be a typedef for basic_string<char> and wstring as a typedef for basic_string<wchar_t>. The magic of templates takes care of the details.
Supporting Cast for the is_palindrome Function Template
The next task is to rewrite is_palindrome itself. The basic_string template actually takes three template arguments, and basic_string_view takes two. The first is the character type, and the next two are details that needn’t concern us at this time. All that matters is that if you want to templatize your own function that deals with strings, you should handle all three of the template parameters.
Before starting, however, you must be aware of a minor hurdle when dealing with functions as arguments to standard algorithms: the argument must be a real function, not the name of a function template. In other words, if you have to work with function templates, such as lowercase and non_letter, you must instantiate the template and pass the template instance. When you pass non_letter and same_char to the remove_if and equal algorithms, be sure to pass the correct template argument too. If Char is the template parameter for the character type, use non_letter<Char> as the functor argument to remove_if.
Changing is_palindrome to a Function Template
The is_palindrome function never uses the Traits template parameter, except to pass it along to basic_string_view. If you’re curious about that parameter, consult a language reference, but be warned that it’s a bit advanced.
Calling is_palindrome is easy, because the compiler uses automatic type deduction to determine whether you are using narrow or wide strings and instantiates the templates accordingly. Thus, the caller doesn’t have to bother with templates at all.
Without further ado, the isletter and lowercase function templates work with wide character arguments. That’s because locales are templates, parameterized on the character type, just like the string and I/O class templates.
However, in order to use wide characters, you do have to perform I/O with wide characters, which is the subject of the next section.
Wide Character I/O
You read wide characters from the standard input by reading from std::wcin. Write wide characters by writing to std::wcout or std::wcerr. Once you read or write anything to or from a stream, the character width of the corresponding narrow and wide streams is fixed, and you cannot change it—you must decide whether to use narrow or wide characters and stay with that choice for the lifetime of the stream. So, a program must use cin or wcin, but not both. Ditto for the output streams. The <iostream> header declares the names of all the standard streams, narrow and wide. The <istream> header defines all the input stream classes and operators; <ostream> defines the output classes and operators. More precisely, <istream> and <ostream> define templates, and the character type is the first template parameter.
As you can guess, the <ostream> header is similar, defining the basic_ostream class template and the ostream and wostream typedefs.
The main Program for Testing is_palindrome
Reading wide characters from a file or writing wide characters to a file is different from reading or writing narrow characters. All file I/O passes through an additional step of character conversion. C++ always interprets a file as a series of bytes. When reading or writing narrow characters, the conversion of a byte to a narrow character is a no-op, but when reading or writing wide characters, the C++ library has to interpret the bytes to form wide characters. It does so by accumulating one or more adjacent bytes to form each wide character. The rules for deciding which bytes are elements of a wide character and how to combine the characters are specified by the encoding rules for a multi-byte character set.
Multi-byte Character Sets
Multi-byte character sets originated in Asia, where demand for characters exceeded the few character slots available in a single-byte character set, such as ASCII. European nations managed to fit their alphabets into 8-bit character sets, but languages such as Chinese, Japanese, Korean, and Vietnamese require far more bits to represent thousands of ideographs, syllables, and native characters.
The requirements of Asian languages spurred the development of character sets that used two bytes to encode a character—hence the common term double-byte character set (DBCS) , with the generalization to multi-byte character sets (MBCS). Many DBCSes were invented, and sometimes a single character had multiple encodings. For example, in Chinese Big 5, the ideograph 丁 has the double-byte value "xA4x42". In the EUC-KR character set (which is popular in Korea), the same ideograph has a different encoding: "xEFxCB".
The typical DBCS uses characters with the most significant bit set (in an 8-bit byte) to represent double characters. Characters with the most significant bit clear would be taken from a single-byte character set (SBCS) . Some DBCSes mandate a particular SBCS; others leave it open, so you get different conventions for different combinations of DBCS and SBCS. Mixing single- and double-byte characters in a single character stream is necessary to represent the common use of character streams that mix Asian and Western text. Working with multi-byte characters is more difficult than working with single-byte characters. A string’s size() function, for example, doesn’t tell you how many characters are in a string. You must examine every byte of the string to learn the number of characters. Indexing into a string is more difficult, because you must take care not to index into the middle of a double-byte character.
Sometimes a single character stream needs more flexibility than simply switching between one particular SBCS and one particular DBCS. Sometimes the stream has to mix multiple double-byte character sets. The ISO 2022 standard is an example of a character set that allows shifting between other, subsidiary character sets. Shift sequences (also called escape sequences, not to be confused with C++ backslash escape sequences) dictate which character set to use. For example, ISO 2022-JP is widely used in Japan and allows switching between ASCII, JIS X 0201 (a SBCS), and JIS X 0208 (a DBCS). Each line of text begins in ASCII, and a shift sequence changes character sets mid-string. For example, the shift sequence "x1B$B" switches to JIS X 0208-1983.
Seeking to an arbitrary position in a file or text stream that contains shift sequences is clearly problematic. A program that has to seek in a multi-byte text stream must keep track of shift sequences in addition to stream positions. Without knowing the most recent shift sequence in the stream, a program has no way of knowing which character set to use to interpret the subsequent characters.
A number of variations on ISO 2022-JP permit additional character sets. The point here is not to offer a tutorial on Asian character sets but to impress on you the complexities of writing a truly open, general, and flexible mechanism that can support the world’s rich diversity in character sets and locales. These and similar problems gave rise to the Unicode project.
Unicode
Unicode is an attempt to get out of the whole character-set mess by unifying all major variations into one, big, happy character set. To a large degree, the Unicode Consortium has succeeded. The Unicode character set has been adopted as an international standard as ISO 10646. However, the Unicode project includes more than just the character set; it also specifies rules for case-folding, character collation, and more.
Unicode provides 1,114,112 possible character values (called code points). So far, the Unicode Consortium has assigned about 100,000 code points to characters, so there’s plenty of room for expansion. The simplest way to represent a million code points is to use a 32-bit integer, and indeed, this is a common encoding for Unicode. It is not the only encoding, however. The Unicode standard also defines encodings that let you represent a code point using one or two 16-bit integers and one to four 8-bit integers.
The standard way to denote a Unicode code point is U+, followed by the code point as a hexadecimal number of at least four places. Thus, 'x41' is the C++ encoding of U+0041 (Latin capital A) and Greek π has code point U+03C0. A musical eighth note (♪) has code point U+266A or U+1D160; the former code point is in a group of miscellaneous symbols, which happens to include an eighth note. The latter code point is part of a group of musical symbols, which you will need for any significant work with music-related characters.
Another common encoding for Unicode uses one to four 8-bit units to make up a single code point. Common characters in Western European languages can usually be represented in a single byte, and many other characters take only two bytes. Less common characters require three or four. The result is an encoding that supports the full range of Unicode code points and almost always consumes less memory than other encodings. This character set is called UTF-8. UTF-8 characters are written in the manner of ordinary character literals prefaced with u8. The type of a UTF-8 string literal is char8_t. A UTF-8 string has type std::u8string .
Representing a Greek letter π requires only two bytes, but with different values than the two low-order bytes in UTF-32: u8"xcfx80". An eighth note (♪) requires three or four bytes, again with a different encoding than that used in UTF-32: u8"xe2x99xaa" or u8"xf0x9dx85xa0".
The primary difficulty when dealing with UTF-8 in a program is that the only way to know how many code points are in a string is to scan the entire string. The size() member function returns the number of storage units in the string, but each code point requires one to four storage units. On the other hand, UTF-8 has the advantage that you can seek to an arbitrary position in a UTF-8 byte stream and know whether that position is in the middle of a multi-byte character because multi-byte characters always have their most significant bitset. By examining the encoding, you can tell whether a byte is the first byte of a multi-byte character or a following byte.
UTF-8 is a common encoding for files and network transmissions. It has become the de facto standard for many desktop environments, word processors (including the one I am using to write this book), web pages, and other everyday applications.
Some other environments use UTF-16, which represents a code point using one or two 16-bit integers. The C++ type for a UTF-16 character literal is char16_t, and the string type is std::u16string. Write such a character literal with the u prefix (lowercase letter u), for example, u'x03c0'.
Unicode’s designers kept the most common code points in the lower 16-bit region (called the Basic Multilingual Plane, or BMP). When a code point is outside the BMP, that is, its value exceeds U+FFFF, it requires two storage units in UTF-16 and is called a surrogate pair . For example, 丁 requires two 16-bit storage units: u"xD834xDD1E".
Thus, you have the same problem as UTF-8, namely, that one storage unit does not necessarily represent a single code point, so UTF-16 is less than ideal as an in-memory representation. But the vast majority of code points that most programs deal with fit in a single UTF-16 storage unit, so UTF-16 usually requires half the memory as UTF-32, and in many cases, a u16string’s size() is the number of code points in the string (although you can’t be sure without scanning the string).
Some programmers cope with the difficulty of working with UTF-16 by ignoring surrogate pairs completely. They assume that size() does indeed return the number of code points in the string, so their programs work correctly only if all code points are from the BMP. This means you lose access to ancient scripts, specialized alphabets and symbols, and infrequently used ideographs.
UTF-8 has an advantage over UTF-16 and UTF-32 encodings for external representations, because you don’t have to deal with endianness. The Unicode standard defines a mechanism for encoding and revealing the endianness of a stream of UTF-16 or UTF-32 text, but that just makes extra work for you.
The position of the most significant byte is called “endianness.” A “big-endian” platform is one with the most significant byte first. A “little-endian” platform puts the least significant byte first. The popular Intel x86 platform is little-endian.
Universal Character Names
Unicode makes another official appearance in the C++ standard. You can specify a character using its Unicode code point. Use uXXXX or UXXXXXXXX, replacing XXXX or XXXXXXXX with the hexadecimal code point. Unlike the x escape, you must use exactly four hexadecimal digits with u or eight with U. These character constructs are called universal character names.
Also note that u and U are not escape sequences (unlike x). You can use them anywhere in a program, not only in a character or string literal. Using a Unicode character name lets you use UTF-8 and UTF-16 strings without knowing the encoding details. Thus, a better way to write the UTF-8 string for Greek lowercase π is u8"u03c0", and the compiler will store the encoded bytes "xcfx80".
and your standard-compliant compiler would accept it and let you use π as an identifier. I don’t recommend using extended characters in identifiers unless you know that everyone reading your code is using tools that are aware of universal character names. Otherwise, they make the code much harder to read, understand, and maintain.
Does your compiler support universal character names in strings? ________________ Does your compiler support universal character names in identifiers? ________________
Unicode Difficulties
_____________________________________________________________
It doesn’t work. There are no I/O stream classes for Unicode. Template specializations for isalnum and so on don’t exist for char8_t, char16_t, or char32_t. Although the standard library offers some functions for converting Unicode strings to and from wstring, the support ends there.
If you have to work with international characters in any meaningful way, you need a third-party library. The most widely used library is International Components for Unicode (ICU). See the book’s website for a current link.
The next and final topic in Part 3 is to further your understanding of text I/O.