© Ray Lischner 2020
R. LischnerExploring C++20https://doi.org/10.1007/978-1-4842-5961-0_58

58. Locales and Facets

Ray Lischner1 
(1)
Ellicott City, MD, USA
 

As you saw in Exploration 18, C++ offers a complicated system to support internationalization and localization of your code. Even if you don’t intend to ship translations of your program in a multitude of languages, you must understand the locale mechanism that C++ uses. Indeed, you have been using it all along, because C++ always sends formatted I/O through the locale system. This Exploration will help you understand locales better and make more effective use of them in your programs.

The Problem

The story of the Tower of Babel resonates with programmers. Imagine a world that speaks a single language and uses a single alphabet. How much simpler programming would be if we didn’t have to deal with character-set issues, language rules, or locales.

The real world has many languages, numerous alphabets and syllabaries, and multitudinous character sets, all making life far richer and more interesting and making a programmer’s job more difficult. Somehow, we programmers must cope. It isn’t easy, and this Exploration cannot give you all the answers, but it’s a start.

Different cultures, languages, and character sets give rise to different methods to present and interpret information, different interpretations of character codes (as you learned in Exploration 18), and different ways of organizing (especially sorting) information. Even with numeric data, you may find that you have to write the same number in several ways, depending on the local environment, culture, and language. Table 58-1 presents just a few examples of the ways to write a number according to various cultures, conventions, and locales.
Table 58-1.

Various Ways to Write a Number

Number

Culture

123456.7890

Default C++

123,456.7890

United States

123 456.7890

International scientific

Rs. 1,23,456.7890

Indian currency*

123.456,7890

Germany

*Yes, the commas are correct.

Other cultural differences can include
  • 12-hour vs. 24-hour clock

  • Time zones

  • Daylight saving time practices

  • How accented characters are sorted relative to non-accented characters (does 'a' come before or after 'á'?)

  • Date formats: month/day/year, day/month/year, or year-month-day

  • Formatting of currency (¥123,456 or 99¢)

Somehow, the poor application programmer must figure out exactly what is culturally dependent, collect the information for all the possible cultures where the application might run, and use that information appropriately in the application. Fortunately, the hard work has already been done for you and is part of the C++ standard library.

Locales to the Rescue

C++ uses a system called locales to manage this disparity of styles. Exploration 18 introduced locales as a means to organize character sets and their properties. Locales also organize formatting of numbers, currency, dates, and times (plus some more stuff that I won’t get into).

C++ defines a basic locale, known as the classic locale, which provides minimal formatting. Each C++ implementation is then free to provide additional locales. Each locale typically has a name, but the C++ standard does not mandate any particular naming convention, which makes it difficult to write portable code. You can rely on only two standard names:
  • The classic locale is named "C". The classic locale specifies the same basic formatting information for all implementations. When a program starts, the classic locale is the initial locale.

  • An empty string ("") means the default, or native, locale. The default locale obtains formatting and other information from the host operating system in a manner that depends on what the OS can offer. With traditional desktop operating systems, you can assume that the default locale specifies the user’s preferred formatting rules and character-set information. With other environments, such as embedded systems, the default locale may be identical to the classic locale.

A number of C++ implementations use ISO and POSIX standards for naming locales: an ISO 639 code for the language (e.g., fr for French, en for English, ko for Korean), optionally followed by an underscore and an ISO 3166 code for the region (e.g., CH for Switzerland, GB for Great Britain, HK for Hong Kong). The name is optionally followed by a dot and the name of the character set (e.g., utf8 for Unicode UTF-8, Big5 for Chinese Big 5 encoding). Thus, I use en_US.utf8 for my default locale. A native of Taiwan might use zh_TW.Big5; developers in French-speaking Switzerland might use fr_CH.latin1. Read your library documentation to learn how it specifies locale names. What is your default locale? ________________ What are its main characteristics?
  • _____________________________________________________________

  • _____________________________________________________________

  • _____________________________________________________________

Every C++ application has a global locale object. Unless you explicitly change a stream’s locale, it starts off with the global locale. (If you later change the global locale, that does not affect streams that already exist, such as the standard I/O streams.) Initially, the global locale is the classic locale. The classic locale is the same everywhere (except for the parts that depend on the character set), so a program has maximum portability with the classic locale. On the other hand, it has minimum local flavor. The next section explores how you can change a stream’s locale.

Locales and I/O

Recall from Exploration 18 that you imbue a stream with a locale in order to format I/O according to the locale’s rules. Thus, to ensure that you read input in the classic locale and that you print results in the user’s native locale, you need the following:
std::cin.imbue(std::locale::classic()); // standard input uses the classic locale
std::cout.imbue(std::locale{""});       // imbue with the user's default locale

The standard I/O streams initially use the classic locale. You can imbue a stream with a new locale at any time, but it makes the most sense to do so before performing any I/O.

Typically, you would use the classic locale when reading from, or writing to, files. You usually want the contents of files to be portable and not dependent on a user’s OS preferences. For ephemeral output to a console or GUI window, you may want to use the default locale, so the user can be most comfortable reading and understanding it. On the other hand, if there is any chance that another program might try to read your program’s output (as happens with UNIX pipes and filters), you should stick with the classic locale, in order to ensure portability and a common format. If you are preparing output to be displayed in a GUI, by all means, use the default locale.

Facets

The way a stream interprets numeric input and formats numeric output is by making requests of the imbued locale. A locale object is a collection of pieces, each of which manages a small aspect of internationalization. For example, one piece, called numpunct , provides the punctuation symbols for numeric formatting, such as the decimal point character (which is '.' in the United States, but ',' in France). Another piece, num_get , reads from a stream and parses the text to form a number, using information it obtains from numpunct . The pieces such as num_get and numpunct are called facets.

For ordinary numeric I/O, you never have to deal with facets. The I/O streams automatically manage these details for you: the operator<< function uses the num_put facet to format numbers for output, and operator>> uses num_get to interpret text as numeric input. For currency, dates, and times, I/O manipulators use facets to format values. But sometimes you need to use facets yourself. The isalpha, toupper, and other character-related functions about which you learned in Exploration 18 use the ctype facet. Any program that has to do a lot of character testing and converting can benefit by managing its facets directly.

Like strings and I/O streams, facets are class templates, parameterized on the character type. So far, the only character type you have used is char; you will learn about other character types in Exploration 59. The principles are the same, regardless of character type (which is why facets use templates).

To obtain a facet from a locale, call the use_facet function template. The template argument is the facet you seek, and the function argument is the locale object. The returned facet is const and is not copyable, so the best way to use the result is to initialize a const reference, as demonstrated here:
auto const& mget{ std::use_facet<std::money_get<char>>(std::locale{""}) };

Reading from the inside outward, the object named mget is initialized to the result of calling the use_facet function, which is requesting a reference to the money_get<char> facet. The default locale is passed as the sole argument to the use_facet function. The type of mget is a reference to a const money_get<char> facet. It’s a little daunting to read at first, but you’ll get used to it—eventually.

Using facets directly can be complicated. Fortunately, the standard library offers a few I/O manipulators (declared in <iomanip>) to simplify the use of the time and currency facets. Listing 58-1 shows a simple program that imbues the standard I/O streams with the native locale and then reads and writes currency values.
import <iomanip>;
import <iostream>;
import <locale>;
import <string>;
int main()
{
  std::locale native{""};
  std::cin.imbue(native);
  std::cout.imbue(native);
  std::cin >> std::noshowbase;  // currency symbol is optional for input
  std::cout << std::showbase;   // always write the currency symbol for output
  std::string digits;
  while (std::cin >> std::get_money(digits))
  {
    std::cout << std::put_money(digits) << ' ';
  }
  if (not std::cin.eof())
    std::cout << "Invalid input. ";
}
Listing 58-1.

Reading and Writing Currency Using the Money I/O Manipulators

The locale manipulators work like other manipulators, but they invoke the associated facets. The manipulators use the stream to take care of the error flags, iterators, fill character, and so on. The get_time and put_time manipulators read and write dates and times; consult a library reference for details.

Character Categories

This section continues the examination of character sets and locales that you began in Exploration 18. In addition to testing for alphanumeric characters or lowercase characters, you can test for several different categories. Table 58-2 lists all the classification functions and their behavior in the classic locale. They all take a character as the first argument and a locale as the second; they all return a bool result.
Table 58-2.

Character Classification Functions

Function

Description

Classic Locale

isalnum

Alphanumeric

'a''z', 'A''Z', '0''9'

isalpha

Alphabetic

'a''z', 'A''Z'

iscntrl

Control

Any non-printable character*

isdigit

Digit

'0''9' (in all locales)

isgraph

Graphical

Printable character other than ' '*

islower

Lowercase

'a''z'

isprint

Printable

Any printable character in the character set*

ispunct

Punctuation

Printable character other than alphanumeric or white space*

isspace

White space

' ', 'f', ' ', ' ', ' ', 'v'

isupper

Uppercase

'A''Z'

isxdigit

Hexadecimal digit

'a''f', 'A''F', '0''9' (in all locales)

*Behavior depends on the character set, even in the classic locale.

The classic locale has fixed definitions for some categories (such as isupper). Other locales, however, can expand these definitions to include other characters, which may (and probably will) depend on the character set too. Only isdigit and isxdigit have fixed definitions for all locales and all character sets.

However, even in the classic locale, the precise implementation of some functions, such as isprint , depends on the character set. For example, in the popular ISO 8859-1 (Latin-1) character set, 'x80' is a control character, but in the equally popular Windows-1252 character set, it is printable. In UTF-8, 'x80' is invalid, so all the categorization functions would return false.

The interaction between the locale and the character set is one of the areas where C++ underperforms. The locale can change at any time, which potentially sets a new character set, which in turn can give new meaning to certain character values. But, the compiler’s view of the character set is fixed. For instance, the compiler treats 'A' as the uppercase Roman letter A and compiles the numeric code according to its idea of the runtime character set. That numeric value is then fixed forever. If the characterization functions use the same character set, everything is fine. The isalpha and isupper functions return true; isdigit returns false; and all is right with the world. If the user changes the locale and by so doing changes the character set, those functions may not work with that character variable any more.

Let’s consider a concrete example as shown in Listing 58-2. This program encodes locale names, which may not work for your environment. Read the comments and see if your environment can support the same kind of locales, albeit with different names. You will need the ioflags class from Listing 40-4. Copy the class to its own module called ioflags or download the file from the book’s website. After reading Listing 58-2, what do you expect as the result?
  • _____________________________________________________________

  • _____________________________________________________________

import <format>;
import <iostream>;
import <locale>;
import <ostream>;
import ioflags;  // from Listing 40-4
/// Print a character's categorization in a locale.
void print(int c, std::string const& name, std::locale loc)
{
  // Don't concern yourself with the & operator. I'll cover that later
  // in the book, in Exploration 63. Its purpose is just to ensure
  // the character's escape code is printed correctly.
  std::cout << std::format("\x{:02x} is {} in {} ", c & 0xff, name, loc.name());
}
/// Test a character's categorization in the locale, @p loc.
void test(char c, std::locale loc)
{
  ioflags save{std::cout};
  if (std::isalnum(c, loc))
    print(c, "alphanumeric", loc);
  else if (std::iscntrl(c, loc))
    print(c, "control", loc);
  else if (std::ispunct(c, loc))
    print(c, "punctuation", loc);
  else
    print(c, "none of the above", loc);
}
int main()
{
  // Test the same code point in different locales and character sets.
  char c{'xd7'};
  // ISO 8859-1 is also called Latin-1 and is widely used in Western Europe
  // and the Americas. It is often the default character set in these regions.
  // The country and language are unimportant for this test.
  // Choose any that support the ISO 8859-1 character set.
  test(c, std::locale{"en_US.iso88591"});
  // ISO 8859-5 is Cyrillic. It is often the default character set in Russia
  // and some Eastern European countries. Choose any language and region that
  // support the ISO 8859-5 character set.
  test(c, std::locale{"ru_RU.iso88595"});
  // ISO 8859-7 is Greek. Choose any language and region that
  // support the ISO 8859-7 character set.
  test(c, std::locale{"el_GR.iso88597"});
  // ISO 8859-8 contains some Hebrew . The character set is no longer widely used.
  // Choose any language and region that support the ISO 8859-8 character set.
  test(c, std::locale{"he_IL.iso88598"});
}
Listing 58-2.

Exploring Character Sets and Locales

What do you get as the actual response?
  • _____________________________________________________________

  • _____________________________________________________________

  • _____________________________________________________________

  • _____________________________________________________________

In case you had trouble identifying locale names or other problems running the program, the following are the results I get when I run it on my system:
xd7 is punctuation in en_US.iso88591
xd7 is alphanumeric in ru_RU.iso88595
xd7 is alphanumeric in el_GR.iso88597
xd7 is none of the above in he_IL.iso88598

As you can see, the same character has different categories, depending on the locale’s character set. Now imagine that the user has entered a string, and your program has stored the string. If your program changes the global locale or the locale used to process that string, you may end up misinterpreting the string.

In Listing 58-2, the categorization functions reload their facets every time they are called, but you can rewrite the program so it loads its facet only once. The character type facet is called ctype . It has a function named is that takes a category mask and a character as arguments and returns a bool: true if the character has a type in the mask. The mask values are specified in std::ctype_base.

Note

Notice the convention that the standard library uses throughout. When a class template needs helper types and constants, they are declared in a non-template base class. The class template derives from the base class and so gains easy access to the types and constants. Callers gain access to the types and constants by qualifying with the base class name. By avoiding the template in the base class, the standard library avoids unnecessary instantiations just to use a type or constant that is unrelated to the template argument.

The mask names are the same as the categorization functions, but without the leading is. Listing 58-3 shows how to rewrite the simple character-set demonstration to use a single cached ctype facet.
import <format>;
import <iostream>;
import <locale>;
import ioflags;  // from Listing 40-4
void print(int c, std::string const& name, std::locale loc)
{
  // Don't concern yourself with the & operator. I'll cover that later
  // in the book. Its purpose is just to ensure the character's escape
  // code is printed correctly.
  std::cout << std::format("\x{:02x} is {} in {} ", c & 0xff, name, loc.name());
}
/// Test a character's categorization in the locale, @p loc.
void test(char c, std::locale loc)
{
  ioflags save{std::cout};
  std::ctype<char> const& ctype{std::use_facet<std::ctype<char>>(loc)};
  if (ctype.is(std::ctype_base::alnum, c))
    print(c, "alphanumeric", loc);
  else if (ctype.is(std::ctype_base::cntrl, c))
    print(c, "control", loc);
  else if (ctype.is(std::ctype_base::punct, c))
    print(c, "punctuation", loc);
  else
    print(c, "none of the above", loc);
}
int main()
{
  // Test the same code point in different locales and character sets.
  char c{'xd7'};
  // ISO 8859-1 is also called Latin-1 and is widely used in Western Europe
  // and the Americas. It is often the default character set in these regions.
  // The country and language are unimportant for this test.
  // Choose any that support the ISO 8859-1 character set.
  test(c, std::locale{"en_US.iso88591"});
  // ISO 8859-5 is Cyrillic. It is often the default character set in Russia
  // and some Eastern European countries. Choose any language and region that
  // support the ISO 8859-5 character set.
  test(c, std::locale{"ru_RU.iso88595"});
  // ISO 8859-7 is Greek. Choose any language and region that
  // support the ISO 8859-7 character set.
  test(c, std::locale{"el_GR.iso88597"});
  // ISO 8859-8 contains some Hebrew. It is no longer widely used.
  // Choose any language and region that support the ISO 8859-8 character set.
  test(c, std::locale{"he_IL.iso88598"});
}
Listing 58-3.

Caching the ctype Facet

The ctype facet also performs case conversions with the toupper and tolower member functions, which take a single character argument and return a character result. Recall the word-counting problem from Exploration 22. Rewrite your solution (see Listings 23-2 and 23-3) and change the sanitize() function to use a cached facet. I recommend replacing the function with a sanitizer class so the class can store the facet in a data member. Compare your program with Listing 58-4.
import <format>;
import <iostream>;
import <locale>;
import <map>;
import <ranges>;
import <string>;
import <string_view>;
using count_map  = std::map<std::string, int>;  ///< Map words to counts
using count_pair = count_map::value_type;       ///< pair of a word and a count
using str_size   = std::string::size_type;      ///< String size type
void initialize_streams()
{
  std::cin.imbue(std::locale{});
  std::cout.imbue(std::locale{});
}
class sanitizer
{
public:
  sanitizer(std::locale const& locale)
  : ctype_{ std::use_facet<std::ctype<char>>(locale) }
  {}
  bool keep(char ch) const { return ctype_.is(ctype_.alnum, ch); }
  char tolower(char ch) const { return ctype_.tolower(ch); }
  std::string operator()(std::string_view str)
  const
  {
    auto data{ str
      | std::ranges::views::filter([this](char ch) { return keep(ch); })
      | std::ranges::views::transform([this](char ch) { return tolower(ch); })  };
    return std::string{ std::ranges::begin(data), std::ranges::end(data) };
  }
private:
    std::ctype<char> const& ctype_;
};
str_size get_longest_key(count_map const& map)
{
  str_size result{0};
  for (auto const& pair : map)
    if (pair.first.size() > result)
      result = pair.first.size();
  return result;
}
void print_pair(count_pair const& pair, str_size longest)
{
  int constexpr count_size{10}; // Number of places for printing the count
  std::cout << std::format("{0:{1}} {2:{3}} ", pair.first, longest, pair.second, count_size);
}
void print_counts(count_map const& counts)
{
  auto longest{get_longest_key(counts)};
  // For each word/count pair...
  for (count_pair pair: counts)
    print_pair(pair, longest);
}
int main()
{
  // Set the global locale to the native locale.
  std::locale::global(std::locale{""});
  initialize_streams();
  count_map counts{};
  sanitizer sanitize{std::locale{""}};
  // Read words from the standard input and count the number of times
  // each word occurs.
  std::string word{};
  while (std::cin >> word)
  {
    std::string copy{sanitize(word)};
    // The "word" might be all punctuation, so the copy would be empty.
    // Don't count empty strings.
    if (not copy.empty())
      ++counts[copy];
  }
  print_counts(counts);
}
Listing 58-4.

Counting Words Again, This Time with Cached Facets

Notice how most of the program is unchanged. The simple act of caching the ctype facet reduces this program’s runtime by about 15% on my system.

Collation Order

You can use the relational operators (such as <) with characters and strings, but they don’t actually compare characters or code points; they compare storage units. Most users don’t care whether a list of names is sorted in ascending numerical order by storage unit. They want a list of names sorted in ascending alphabetical order, according to their native collation rules.

For example, which comes first: ångstrom or angle? The answer depends on where you live and what language you speak. In Scandinavia, angle comes first, and ångstrom follows zebra. The collate facet compares strings according to the locale’s rules. Its compare function is somewhat clumsy to use, so the locale class template provides a simple interface for determining whether one string is less than another in a locale: use the locale’s function call operator. In other words, you can use a locale object itself as the comparison functor for standard algorithms, such as sort. Listing 58-5 shows a program that demonstrates how collation order depends on locale. In order to get the program to run in your environment, you may have to change the locale names.
import <algorithm>;
import <iostream>;
import <iterator>;
import <locale>;
import <string>;
import <vector>;
void sort_words(std::vector<std::string> words, std::locale loc)
{
  std::ranges::sort(words, loc);
  std::cout << loc.name() << ": ";
  std::ranges::copy(words,
            std::ostream_iterator<std::string>(std::cout, " "));
}
int main()
{
  std::vector<std::string> words{
    "circus",
    "u00e5ngstrom",     // ångstrom
    "u00e7ircu00ea",   // çircê
    "angle",
    "essen",
    "ether",
    "u00e6ther",        // æther
    "aether",
    "eu00dfen"         // eßen
  };
  sort_words(words, std::locale::classic());
  sort_words(words, std::locale{"en_GB.utf8"});  // Great Britain
  sort_words(words, std::locale{"no_NO.utf8"});  // Norway
}
Listing 58-5.

Demonstrating How Collation Order Depends on Locale

The uNNNN characters are a portable way to express Unicode characters. The NNNN must be four hexadecimal digits, specifying a Unicode code point. You will learn more in the next Exploration.

The boldface line shows how the locale object is used as a comparison functor to sort the words. Table 58-3 lists the results I get for each locale. Depending on your native character set, you may get different results.
Table 58-3.

Collation Order for Various Locales

Classic

Great Britain

Norway

aether

aether

aether

angle

æther

angle

circus

angle

çircê

essen

ångstrom

circus

ether

çircê

essen

eßen

circus

eßen

ångstrom

essen

ether

æther

eßen

æther

çircê

ether

ångstrom

The next Exploration takes a closer look at Unicode, international character sets, and related challenges.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.171.19