As you saw in Exploration 18, C++ offers a complicated system to support internationalization and localization of your code. Even if you don’t intend to ship translations of your program in a multitude of languages, you must understand the locale mechanism that C++ uses. Indeed, you have been using it all along, because C++ always sends formatted I/O through the locale system. This Exploration will help you understand locales better and make more effective use of them in your programs.
The Problem
The story of the Tower of Babel resonates with programmers. Imagine a world that speaks a single language and uses a single alphabet. How much simpler programming would be if we didn’t have to deal with character-set issues, language rules, or locales.
The real world has many languages, numerous alphabets and syllabaries, and multitudinous character sets, all making life far richer and more interesting and making a programmer’s job more difficult. Somehow, we programmers must cope. It isn’t easy, and this Exploration cannot give you all the answers, but it’s a start.
Various Ways to Write a Number
Number | Culture |
---|---|
123456.7890 | Default C++ |
123,456.7890 | United States |
123 456.7890 | International scientific |
Rs. 1,23,456.7890 | Indian currency* |
123.456,7890 | Germany |
12-hour vs. 24-hour clock
Time zones
Daylight saving time practices
How accented characters are sorted relative to non-accented characters (does 'a' come before or after 'á'?)
Date formats: month/day/year, day/month/year, or year-month-day
Formatting of currency (¥123,456 or 99¢)
Somehow, the poor application programmer must figure out exactly what is culturally dependent, collect the information for all the possible cultures where the application might run, and use that information appropriately in the application. Fortunately, the hard work has already been done for you and is part of the C++ standard library.
Locales to the Rescue
C++ uses a system called locales to manage this disparity of styles. Exploration 18 introduced locales as a means to organize character sets and their properties. Locales also organize formatting of numbers, currency, dates, and times (plus some more stuff that I won’t get into).
The classic locale is named "C". The classic locale specifies the same basic formatting information for all implementations. When a program starts, the classic locale is the initial locale.
An empty string ("") means the default, or native, locale. The default locale obtains formatting and other information from the host operating system in a manner that depends on what the OS can offer. With traditional desktop operating systems, you can assume that the default locale specifies the user’s preferred formatting rules and character-set information. With other environments, such as embedded systems, the default locale may be identical to the classic locale.
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
Every C++ application has a global locale object. Unless you explicitly change a stream’s locale, it starts off with the global locale. (If you later change the global locale, that does not affect streams that already exist, such as the standard I/O streams.) Initially, the global locale is the classic locale. The classic locale is the same everywhere (except for the parts that depend on the character set), so a program has maximum portability with the classic locale. On the other hand, it has minimum local flavor. The next section explores how you can change a stream’s locale.
Locales and I/O
The standard I/O streams initially use the classic locale. You can imbue a stream with a new locale at any time, but it makes the most sense to do so before performing any I/O.
Typically, you would use the classic locale when reading from, or writing to, files. You usually want the contents of files to be portable and not dependent on a user’s OS preferences. For ephemeral output to a console or GUI window, you may want to use the default locale, so the user can be most comfortable reading and understanding it. On the other hand, if there is any chance that another program might try to read your program’s output (as happens with UNIX pipes and filters), you should stick with the classic locale, in order to ensure portability and a common format. If you are preparing output to be displayed in a GUI, by all means, use the default locale.
Facets
The way a stream interprets numeric input and formats numeric output is by making requests of the imbued locale. A locale object is a collection of pieces, each of which manages a small aspect of internationalization. For example, one piece, called numpunct , provides the punctuation symbols for numeric formatting, such as the decimal point character (which is '.' in the United States, but ',' in France). Another piece, num_get , reads from a stream and parses the text to form a number, using information it obtains from numpunct . The pieces such as num_get and numpunct are called facets.
For ordinary numeric I/O, you never have to deal with facets. The I/O streams automatically manage these details for you: the operator<< function uses the num_put facet to format numbers for output, and operator>> uses num_get to interpret text as numeric input. For currency, dates, and times, I/O manipulators use facets to format values. But sometimes you need to use facets yourself. The isalpha, toupper, and other character-related functions about which you learned in Exploration 18 use the ctype facet. Any program that has to do a lot of character testing and converting can benefit by managing its facets directly.
Like strings and I/O streams, facets are class templates, parameterized on the character type. So far, the only character type you have used is char; you will learn about other character types in Exploration 59. The principles are the same, regardless of character type (which is why facets use templates).
Reading from the inside outward, the object named mget is initialized to the result of calling the use_facet function, which is requesting a reference to the money_get<char> facet. The default locale is passed as the sole argument to the use_facet function. The type of mget is a reference to a const money_get<char> facet. It’s a little daunting to read at first, but you’ll get used to it—eventually.
Reading and Writing Currency Using the Money I/O Manipulators
The locale manipulators work like other manipulators, but they invoke the associated facets. The manipulators use the stream to take care of the error flags, iterators, fill character, and so on. The get_time and put_time manipulators read and write dates and times; consult a library reference for details.
Character Categories
Character Classification Functions
Function | Description | Classic Locale |
---|---|---|
isalnum | Alphanumeric | 'a'–'z', 'A'–'Z', '0'–'9' |
isalpha | Alphabetic | 'a'–'z', 'A'–'Z' |
iscntrl | Control | Any non-printable character* |
isdigit | Digit | '0'–'9' (in all locales) |
isgraph | Graphical | Printable character other than ' '* |
islower | Lowercase | 'a'–'z' |
isprint | Printable | Any printable character in the character set* |
ispunct | Punctuation | Printable character other than alphanumeric or white space* |
isspace | White space | ' ', 'f', ' ', ' ', ' ', 'v' |
isupper | Uppercase | 'A'–'Z' |
isxdigit | Hexadecimal digit | 'a'–'f', 'A'–'F', '0'–'9' (in all locales) |
The classic locale has fixed definitions for some categories (such as isupper). Other locales, however, can expand these definitions to include other characters, which may (and probably will) depend on the character set too. Only isdigit and isxdigit have fixed definitions for all locales and all character sets.
However, even in the classic locale, the precise implementation of some functions, such as isprint , depends on the character set. For example, in the popular ISO 8859-1 (Latin-1) character set, 'x80' is a control character, but in the equally popular Windows-1252 character set, it is printable. In UTF-8, 'x80' is invalid, so all the categorization functions would return false.
The interaction between the locale and the character set is one of the areas where C++ underperforms. The locale can change at any time, which potentially sets a new character set, which in turn can give new meaning to certain character values. But, the compiler’s view of the character set is fixed. For instance, the compiler treats 'A' as the uppercase Roman letter A and compiles the numeric code according to its idea of the runtime character set. That numeric value is then fixed forever. If the characterization functions use the same character set, everything is fine. The isalpha and isupper functions return true; isdigit returns false; and all is right with the world. If the user changes the locale and by so doing changes the character set, those functions may not work with that character variable any more.
_____________________________________________________________
_____________________________________________________________
Exploring Character Sets and Locales
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
As you can see, the same character has different categories, depending on the locale’s character set. Now imagine that the user has entered a string, and your program has stored the string. If your program changes the global locale or the locale used to process that string, you may end up misinterpreting the string.
In Listing 58-2, the categorization functions reload their facets every time they are called, but you can rewrite the program so it loads its facet only once. The character type facet is called ctype . It has a function named is that takes a category mask and a character as arguments and returns a bool: true if the character has a type in the mask. The mask values are specified in std::ctype_base.
Notice the convention that the standard library uses throughout. When a class template needs helper types and constants, they are declared in a non-template base class. The class template derives from the base class and so gains easy access to the types and constants. Callers gain access to the types and constants by qualifying with the base class name. By avoiding the template in the base class, the standard library avoids unnecessary instantiations just to use a type or constant that is unrelated to the template argument.
Caching the ctype Facet
Counting Words Again, This Time with Cached Facets
Notice how most of the program is unchanged. The simple act of caching the ctype facet reduces this program’s runtime by about 15% on my system.
Collation Order
You can use the relational operators (such as <) with characters and strings, but they don’t actually compare characters or code points; they compare storage units. Most users don’t care whether a list of names is sorted in ascending numerical order by storage unit. They want a list of names sorted in ascending alphabetical order, according to their native collation rules.
Demonstrating How Collation Order Depends on Locale
The uNNNN characters are a portable way to express Unicode characters. The NNNN must be four hexadecimal digits, specifying a Unicode code point. You will learn more in the next Exploration.
Collation Order for Various Locales
Classic | Great Britain | Norway |
---|---|---|
aether | aether | aether |
angle | æther | angle |
circus | angle | çircê |
essen | ångstrom | circus |
ether | çircê | essen |
eßen | circus | eßen |
ångstrom | essen | ether |
æther | eßen | æther |
çircê | ether | ångstrom |
The next Exploration takes a closer look at Unicode, international character sets, and related challenges.