Chapter 14. Internationalization

As the global market has increased in importance, so has internationalization (or i18n for short)[1] become more important for software development. As a consequence, the C++ standard library provides concepts to write code for international programs. These concepts influence mainly the use of I/O and string processing. This chapter describes these concepts. Many thanks to Dietmar Kühl, who is an expert on I/O and internationalization in the C++ standard library and wrote major parts of this chapter.

The C++ standard library provides a general approach to support national conventions without being bound to specific conventions. This goes to the extent, for example, that strings are not bound to a specific character type to support 16-bit characters in Asia. For the internationalization of programs, two related aspects are important:

  1. Different character sets have different properties. Handling them requires flexible solutions for problems, such as what is considered to be a letter or, worse, what type to use to represent characters. For character sets with more than 256 characters, type char is not sufficient as a representation.

  2. The user of a program expects to see national or cultural conventions obeyed (for example, the formatting of dates, monetary values, numbers, and Boolean values).

For both aspects, the C++ standard library provides related solutions.

The major approach toward internationalization is to use locale objects to represent an extensible collection of aspects to be adapted to specific local conventions. Locales are already used in C to adapt to specific local conventions. In the C++ standard, this mechanism was generalized and made more flexible. Actually, the C++ locale mechanism can be used to address all kinds of customization, depending on the user's environment or preferences. For example, it can be extended to deal with measurement systems, time zones, or paper size.

Most of the mechanisms of internationalization involve no or only minimal additional work for the programmer. For example, when doing I/O with the C++ stream mechanism, numeric values are formatted according to the rules of some locale. The only work for the programmer is to instruct the I/O stream classes to use the user's preferences.

In addition to such automatic use, the programmer may use locale objects directly for formatting, collation, character classification, and so on. Some internationalized aspects supported by the C++ standard library are not used by the C++ standard library itself, and to use them the programmer has to call those functions manually. For example, there are no stream functions defined in the C++ standard library that do time, date, or monetary formatting. To use these services, it is necessary to call them directly (for example, in user-defined stream operators writing objects of a money class).

Strings and streams use another concept for internationalization: character traits. They define fundamental properties and operations that differ for different character sets, such as the value of "end-of-file" as well as functions to compare, assign, and copy strings.

The classes for internationalization were introduced to the standard relatively late. Although the general approach is extremely flexible, it still needs some work to make it really complete. For example, the functions for string collation (that is, comparing strings for sorting according to some locale conventions) use only iterators of type const charT*, where charT is some character type. Although it is very likely that basic_string<charT> uses this type as an iterator type, it is not at all guaranteed. Thus, it is not guaranteed that string iterators can be used as arguments to the functions for string collation. However, it is possible to use the result of basic_string data() member functions with the string collation functions.

Different Character Encodings

One area internationalization addresses is how to handle different character encodings. This issue arises mainly in Asia, where different encodings are used to represent the same character set. The issue normally comes in conjunction with character encodings that use more than 8 bits. To process such characters, it is necessary to use new concepts and functions for text processing.

Wide-Character and Multibyte Text

Two different approaches are common to address character sets that have more than 256 characters: multibyte representation and wide-character representation:

  1. With multibyte representation, the number of bytes used for a character is variable. A 1 -byte character, such as an ISO Latin-1 character, can be followed by a 3-byte character, such as a Japanese ideogram.

  2. With wide-character representation, the number of bytes used to represent a character is always the same, independent of the character being represented. Typical representations use 2 or 4 bytes. Conceptually, this does not differ from representations that use just 1 byte for locales, where ISO Latin-1 or even ASCII is sufficient.

This multibyte representation is more compact than the wide-character representation. Thus, the multibyte representation is normally used to store data outside of programs. Conversely, it is much easier to process characters of fixed size, so the wide-character representation is usually used inside programs.

Like ISO C, ISO C++ uses the type wchar_t to represent wide characters. However in C++, wchar_t is a keyword rather than a type definition. Thus, it is possible to overload all functions with this type.

In a multibyte string, the same byte may represent a character or even just a part of the character. During iteration through a multibyte string, each byte is interpreted according to a current "shift state." Depending on the value of the byte and the current shift state, a byte may represent a certain character or a change of the current shift state. A multibyte string always starts in some defined initial shift state. For example, in the initial shift state the bytes may represent ISO Latin-1 characters until an escape character is encountered. The character following the escape character identifies the new shift state. For example, that character may switch to a shift state in which the bytes are interpreted as Arabic characters until the next escape character is encountered.

The class template codecvt<> (described in Section 14.4.4,) is used to convert between different character encodings. This class is used mainly by the class basic_filebuf <> (see page 627) to convert between internal and external representations. The C++ standard actually makes no assumptions about multibyte character encodings, but it supports the notion of shift states. The members of the codecvt<> class support an argument that may be used to store an arbitrary state of a string. They also support a function intended to determine the character sequence used to return to the initial shift state.

Character Traits

The different representations of character sets imply variations that are relevant for the processing of strings and I/O. For example, the value used to represent "end-of-file" or the details of comparing characters may differ for representations.

The string and stream classes are intended to be instantiated with built-in types, especially with char and wchar_t. The interface of built-in types cannot be changed. Thus, the details on how to deal with aspects that depend on the representation are factored into a separate class, a so-called "traits class." Both the string and stream classes take a traits class as a template argument. This argument defaults to the class char_traits, parameterized with the template argument that defines the character type of the string or stream:

    namespace std {
        template<class charT,
                 class traits = char_traits<charT>,
                 class Allocator = allocator<charT> >
        class basic_string;
    }
    namespace std {
        template <class charT,
                  class traits = char_traits<charT> >
        class basic_istream;
        template <class charT,
                  class traits = char_traits<charT> >
        class basic_ostream;
        ...
    }

The character traits have type char_traits<>. This type is defined in <string> and is parameterized for the specific character type:

    namespace std {
        template <class charT>
        struct char_traits {
            ...
        };
    }

The traits classes define all fundamental properties of the character type and the corresponding operations necessary for the implementation of strings and streams as static components. Table 14.1 lists the members of char_traits.

The functions that process strings or character sequences are present for optimization only. They could also be implemented by using the functions that process single characters. For example, copy() can be implemented using assign(). However, there might be more efficient implementations when dealing with strings.

Note that counts used in the functions are exact counts, not maximum counts. That is, string termination characters within these sequences are ignored.

The last group of functions cares about the special processing of the character that represents end-of-file (EOF). This character extends the character set by an artificial character to indicate special processing. For some representations, the character type may be insufficient to accommodate this special character because it has to have a value that differs from the values of all "normal" characters of the character set. C established the convention to return a character as int instead of as char from functions reading characters. This technique was extended in C++. The character traits define char_type as the type to represent all characters, and int_type as the type to represent all characters plus EOF. The functions to_char_type(), to_int_type(), not_eof(), and eq_int_type() define the corresponding conversions and comparisons. It is possible that char_type and int_type are identical for some character traits. This can be the case if not all values of char_type are necessary to represent characters so that there is a spare value that can be used for end-of-file.

pos_type and off_type are used to define file positions and offsets (see page 634 for details).

Table 14.1. Character Traits Members

Expression Meaning
char_type The character type (that is, the template argument for char_traits)
int_type A type large enough to represent an additional, otherwise unused value for end-of-file
pos_type A type used to represent positions in streams
off_type A type used to represent offsets between positions in streams
state_type A type used to represent the current state in multibyte streams
assign (c1,c2) Assigns character c2 to c1
eq(c1,c2) Returns whether the characters c1 and c2 are equal
It(c1,c2) Returns whether character c1 is less than character c2
length (s) Returns the length of the string s
compare (s1 ,s2 ,n) Compares up to n characters of strings s1 and s2
copy (s1,s2, n) Copies n characters of string s2 to string s1
move(s1,s2,n) Copies n characters of string s2 to string s1, where s1 and s2 may overlap
assign (s, n,c) Assigns the character c to n characters of string s
find(s,n,c) Returns a pointer to the first character in string s that is equal to c, or returns zero, if there is no such character among the first n characters
eof() Returns the value of end-of-file
to_int_type(c) Converts the character c into the corresponding representation as int_type
to_char_type(i) Converts the representation i as int_type to a character (the result of converting EOF is undefined)
not_eof (i) Returns the value i unless i is the value for EOF; in this case an implementation-dependent value different from EOF is returned
eq_int_type(i1 ,i2) Tests the equality of the two characters i1 and i2 represented as int_type (that is, the characters may be EOF)

The C++ standard library provides specializations of char_traits<> for types char and wchar_t:

    namespace std {
        template<> struct char_traits<char>;
        template<> struct char_traits<wchar_t>;
}

The specialization for char is usually implemented by using the global string functions of C that are defined in <cstring> or <string.h>. An implementation might look as follows:

    namespace std {
      template<> struct char_traits<char> {
        //type definitions:
        typedef char      char_type;
        typedef int       int_type;
        typedef streampos pos_type;
        typedef streamoff off_type;
        typedef mbstate_t state_type;


        //functions:
        static void assign(char& c1, const char& c2) {
            c1 = c2;
        }
        static bool eq(const char& c1, const char& c2) {
            return c1 == c2;
        }
        static bool It(const char& c1, const char& c2) {
            return c1 < c2;
        }
        static size_t length(const char* s) {
            return strlen(s);
        }
        static int compare(const char* s1, const char* s2, size_t n) {
            return memcmp(s1,s2,n);
        }
        static char* copy(char* s1, const char* s2, size_t n) {
            return (char*)memcpy(s1,s2,n);
        }
        static char* move(char* s1, const char* s2, size_t n) {
            return (char*)memmove(s1,s2,n);
        }
        static char* assign(char* s, size_t n, char c) {
            return (char*)memset(s,c,n);
        }
        static const char* find(const char* s, size_t n,
                                const char& c) {
        return (const char*)memchr(s,c,n);
        }
        static int eof() {
            return EOF;
        }
        static int to_int_type(const char& c) {
            return (int)(unsigned char)c;
        }
        static char to_char_type(const int& i) {
            return (char)i;
        }
        static int not_eof(const int& i) {
            return i!=EOF ? i : !EOF;
        }
        static bool eq_int_type(const int& i1, const int& i2) {
            return i1 == i2;
        }
    };

See Section 11.2.14, for the implementation of a user-defined traits class that lets strings behave in a case-insensitive manner.

Internationalization of Special Characters

One issue in conjunction with character encodings remains: How are special characters such as the newline or the string termination character internationalized? The class basic_ios has members widen() and narrow() that can be used for this purpose. Thus, the newline character in an encoding appropriate for the stream strm can be written as follows:

   strm. widen ('
')     // internationalized newline character

The string termination character in the same encoding can be created like this:

   strm. widen ('')      // internationalized string termination character

See the implementation of the end1 manipulator on page 613 for an example use.

The functions widen() and narrow() actually use a locale object, more precisely the ctype facet of this object. This facet can be used to convert all characters between char and some other character representations. It is described in Section 14.4.4,. For example, the following expression converts the character c of type char into an object of type char_type by using the locale object loc[2]:

    std::use_facet<std::ctype<char_type> >(loc).widen(c)

The details of the use of locales and their facets are described in the following sections.

The Concept of Locales

A common approach to internationalization is to use environments, called locales, to encapsulate national or cultural conventions. The C community uses this approach. Thus, in the context of internationalization, a locale is a collection of parameters and functions used to support national or cultural conventions. According to X/Open conventions,[3] the environment variable LANG is used to define the locale to be used. Depending on this locale, different formats for floating-point numbers, dates, monetary values, and so on are used.

The format of the string defining a locale is normally this:

language [_area [.code]]

language represents the language, such as English or German, area is the area, country, or culture where this language is used. It is used, for example, to support different national conventions even if the same language is used in different nations. code defines the character encoding to be used. This is mainly important in Asia, where different character encodings are used to represent the same character set.

Table 14.2 presents a selection of typical language strings. However, note that these strings are not yet standardized. For example, sometimes the first character of language is capitalized. Some implementations deviate from the format mentioned previously and, for example, use english to select an English locale. All in all, the locales that are supported by a system are implementation specific.

For programs, it is normally no problem that these names are not standardized! This is because the locale information is provided by the user in some form. It is common that programs simply read environment variables or some similar database to determine which locales to use. Thus, the burden of finding the correct locale names is put on the users. Only if the program always uses a special locale does the name need to be hard coded in the program. Normally, for this case, the C locale is sufficient, and is guaranteed to be supported by all implementations and to have the name C.

The next section presents the use of different locales in C++ programs. In particular, it introduces facets of locales that are used to deal with specific formatting details.

C also provides an approach to handle the problem of character sets with more than 256 characters. This approach is to use the character type wchar_t, a type definition for one of the integral types with language support for wide-character constants and wide-character string literals. However, apart from this, only functions to convert between wide characters and narrow characters are supported. This approach was also incorporated into C++ with the character type wchar_t, which is, unlike the C approach, a distinct type in C++. However, C++ provides more library support than C, because basically everything available for char is also available for wchar_t, and any other type may be used as a character type.

Table 14.2. Selection of Locale Names

Locale Meaning
c Default: ANSI-C conventions (English, 7 bit)
de_DE German in Germany
de_DE. 88591 German in Germany with ISO Latin-1 encoding
de_AT German in Austria
de_CH German in Switzerland
en_US English in the United States
en_GB English in Great Britain
en_AU English in Australia
en_CA English in Canada
fr_FR French in France
fr_CH French in Switzerland
fr_CA French in Canada
ja_JP. jis Japanese in Japan with Japanese Industrial Standard (JIT) encoding
ja_JP. sjis Japanese in Japan with Shift JIS encoding
ja_JP.ujis Japanese in Japan with UNIXized JIS encoding
ja_JP.EUC Japanese in Japan with Extended UNIX Code encoding
ko_KR Korean in Korea
zh_CN Chinese in China
zh_TW Chinese in Taiwan
lt_LN.bit7 ISO Latin, 7 bit
lt_LN.bit8 ISO Latin, 8 bit
POSIX POSIX conventions (English, 7 bit)

Using Locales

Using translations of textual messages is normally not sufficient for true internationalization. For example, different conventions for numeric, monetary, or date formatting also have to be used. In addition, functions manipulating letters should depend on character encoding to ensure the correct handling of all characters that are letters in a given language.

According to the POSIX and X/Open standards, it is already possible in C programs to set a locale. This is done using the function setlocale(). Changing the locale influences the results of character classification and manipulation functions, such as isupper() and toupper(), and the I/O functions, such as printf().

However, the C approach has several limitations. Because the locale is a global property, using more than one locale at the same time (for example, when reading floating-point numbers in English and writing them in German) is either not possible or is possible only with a relatively large effort. Also, locales cannot be extended. They provide only the facilities the implementation chooses to provide. If something the C locales do not provide must also be adapted to national conventions, a different mechanism has to be used to do this. Finally, it is not possible to define new locales to support special cultural conventions.

The C++ standard library addresses all of these problems with an object-oriented approach. First, the details of a locale are encapsulated in an object of type locale. Doing this immediately provides the possibility of using multiple locales at the same time. Operations that depend on locales are configured to use a corresponding locale object. For example, a locale object can be installed for each I/O stream, which is then used by the different member functions to adapt to the corresponding conventions. This is demonstrated by the following example:

// i18n/loc1.cpp

   #include <iostream>
   #include <locale>
   using namespace std;

   int main()
   {
       // use classic C locale to read data from standard input
       cin.imbue(locale::classic());

       // use a German locale to write data to standard ouput
       cout.imbue(locale("de_DE"));

       // read and output floating-point values in a loop

       double value;
       while (cin >> value) {
           cout << value << endl;
       }
   }

The statement

    cin.imbue(locale::classic());

assigns the "classic" C locale to the standard input channel. For the classic C locale, formatting of numbers and dates, character classification, and so on is handled as it is in original C without any locales. The expression

    std::locale::classic()

obtains a corresponding object of class locale. Using the expression

    std::locale("C")

instead would yield the same result. This last expression constructs a locale object from a given name. The name "C" is a special name, and actually is the only one a C++ implementation is required to support. There is no requirement to support any other locale, although it is assumed that C++ implementations also support other locales.

Correspondingly, the statement

    cout.imbue (locale("de_DE"));

assigns the locale de_DE to the standard output channel. This is, of course, successful only if the system supports this locale. If the name used to construct a locale object is unknown to the implementation, an exception of type runtime_error is thrown.

If everything was successful, input is read according to the classic C conventions and output is written according to the German conventions. The loop thus reads floating-point values in the normal English format, for example

   47.11

and prints them using the German format, for example

   47,11

Yes, the Germans really use a comma as a "decimal point".

Normally, a program does not predefine a specific locale except when writing and reading data in a fixed format. Instead, the locale is determined using the environment variable LANG. Another possibility is to read the name of the locale to be used. The following program demonstrates this:

// i18n/loc2.cpp

   #include <iostream>
   #include <locale>
   #include <string>
   #include <cstdlib>
   using namespace std;

   int main()
   {
       //create the default locale from the user's environment
       locale langLocale("");


       //and assign it to the standard ouput channel
       cout.imbue(langLocale);


       //process the name of the locale
       bool isGerman;
       if (langLocale.name() == "de_DE" ||
           langLocale.name() == "de" ||
           langLocale.name() == "german") {
             isGerman = true;
       }
       else {
             isGerman = false;
       }


       //read locale for the input
       if (isGerman) {
           cout << "Sprachumgebung fuer Eingaben: ";
       }
       else {
           cout << "Locale for input: ";
       }
       string s;
       cin >> s;
       if (!cin) {
           if (isGerman) {
               cerr << "FEHLER beim Einlesen der Sprachumgebung"
                    << endl;
           }
           else {
               cerr << "ERROR while reading the locale" << endl;
           }
           return EXIT_FAILURE;
       }
       locale cinLocale(s.c_str());


       //and assign it to the standard input channel
       cin.imbue(cinLocale);

       //read and output floating-point values in a loop
       double value;
       while (cin >> value) {
           cout << value << endl;
       }
    }

In this example, the following statement creates an object of the class locale:

    locale langLocale("");

Passing an empty string as the name of the locale has a special meaning: The default locale from the user's environment is used (this is often determined by the environment variable LANG). This locale is assigned to the standard input stream with the statement

    cout.imbue(langLocale);

The expression

    langLocale.name()

is used to retrieve the name of the default locale, which is returned as an object of type string (see Chapter 11).

The following statements construct a locale from a name read from standard input:

   string s;
   cin >> s;
   ...
   locale cinLocale(s.c_str());

To do this, a word is read from the standard input and used as the constructor's argument. If the read fails, the ios_base::failbit is set in the input stream, which is checked and handled in this program:

   if (!cin) {
       if (isGerman) {
           cerr << "FEHLER beim Einlesen der Sprachumgebung"
                << endl;
       }
       else {
           cerr << "ERROR while reading the locale" << endl;
       }
       return EXIT_FAILURE;
   }

Again, if the string is not a valid value for the construction of a locale, a runtime_error exception is thrown.

If a program wants to honor local conventions, it should use corresponding locale objects. The static member function global() of the class locale can be used to install a global locale object. This object is used as the default value for functions that take an optional locale object as an argument. If the locale object set with the global() function has a name, it is also arranged that the C functions dealing with locales react correspondingly. If the locale set has no name, the consequences for the C functions depend on the implementation.

Here is an example of how to set the global locale object depending on the environment in which the program is running:

    / * create a locale object depending on the program's environment and
* set it as the global object
*/
    std::locale::global(std::locale(""));

Among other things, this arranges for the corresponding registration for the C functions to be executed. That is, the C functions are influenced as if the following call was made:

    std::setlocale(LC_ALL,"")

However, setting the global locale does not replace locales already stored in objects. It only modifies the locale object copied when a locale is created with a default constructor. For example, the stream objects store locale objects that are not replaced by a call to locale::global(). If you want an existing stream to use a specific locale, you have to tell the stream to use this locale using the imbue() function.

The global locale is used if a locale object is created with the default constructor. In this case, the new locale behaves as if it is a copy of the global locale at the time it was constructed. The following three lines install the default locale for the standard streams:

// register global locale object for streams
    std::cin.imbue(std::locale());
    std::cout.imbue(std::locale());
    std::cerr.imbue(std::locale());

When using locales in C++, it is important to remember that the C++ locale mechanism is only loosely coupled to the C locale mechanism. There is only one relation to the C locale mechanism: The global C locale is modified if a named C++ locale object is set as the global locale. In general, you should not assume that the C and the C++ functions operate on the same locales.

Locale Facets

The actual dependencies on national conventions are separated into several aspects that are handled by corresponding objects. An object dealing with a specific aspect of internationalization is called a facet. A locale object is used as a container of different facets. To access an aspect of a locale, the type of the corresponding facet is used as the index. The type of the facet is passed explicitly as a template argument to the template function use_facet(), accessing the desired facet. For example, the expression

    std::use_facet<std::numpunct<char> >(loc)

accesses the facet type numpunct for the character type char of the locale object loc. Each facet type is defined by a class that defines certain services. For example, the facet type numpunct provides services used in conjunction with the formatting of numeric and Boolean values. For example, the following expression returns the string used to represent true in the locale loc.

    std::use_facet<std::numpunct<char> >(loc).truename()

Table 14.3 provides an overview over the facets predefined by the C++ standard library. Each facet is associated with a category. These categories are used by some of the constructors of locales to create new locales as the combination of other locales.

Table 14.3. Facet Types Predefined by the C++ Standard Library

Category Facet Type Used for
numeric num_get<>() Numeric input
 num_put<>() Numeric output
 numpunct<>() Symbols used for numeric I/O
time time_get<>() Time and date input
 time_put<>() Time and date output
monetary money_get<>() Monetary input
 money_put<>() Monetary output
 moneypunct <>() Symbols used for monetary I/O
ctype ctype<>() Character information(toupper() , isupper())
 codecvt<>() Conversion between different character encodings
collate collate<>() String collation
messages messages<>() Message string retrieval

It is possible to define your own versions of the facets to create specialized locales. The following examples demonstrates how this is done. It defines a facet using German representations of the Boolean values:

    class germanBoolNames : public std::numpunct_byname<char> {
      public:
        germanBoolNames (const char *name)
          : std::numpunct_byname<char>(name) {
        }
      protected:
        virtual std::string do_truename() const {
            return "wahr";
        }
        virtual std::string do_falsename() const {
            return "falsch";
        }
    };

The class germanBoolNames derives from the class numpunct_byname, which is defined by the C++ standard library. This class defines punctuation properties depending on the locale used for numeric formatting. Deriving from numpunct_byname instead of from numpunct lets you customize the members not overridden explicitly. The values returned from these members still depend on the name used as the argument to the constructor. If the class numpunct had been used as the base class, the behavior of the other functions would be fixed. However, the class germanBoolNames overrides the two functions used to determine the textual representation of true and false.

To use this facet in a locale, you need to create a new locale using a special constructor of the class locale. This constructor takes a locale object as its first argument and a pointer to a facet as its second argument. The created locale is identical to the first argument except for the facet that is passed as the second argument. This facet is installed in the newly create locale after the first argument is copied:

    std::locale loc (std::locale(""), new germanBoolNames(""));

The new expression creates a facet that is installed in the new locale. Thus, it is registered in loc to create a variation of locale(""). Since locales are immutable, you have to create a new locale object if you want to install a new facet to a locale. This locale object can be used like any other locale object. For example,

    std::cout.imbue(loc);
    std::cout << std::boolalpha << true << std::endl;

would have the following output:

    wahr

You also can create a completely new facet. In this case, the function has_facet() can be used to determine whether such a new facet is registered for a given locale object.

Locales in Detail

A C++ locale is an immutable container for facets. It is defined in the <locale> header file as follows:

    namespace std {
        class locale {
        public:
          // global locale objects
          static const locale& classic();            //classic C locale
          static       locale global(const locale&); //set global locale
// internal types and values
          class facet;
          class id;
          typedef int category;
          static const category none, numeric, time, monetary,
                                ctype, collate, messages, all;

          // constructors
          locale() throw();
          explicit locale (const char* name);

          // create locale based on other locales
          locale (const locale& loc) throw();
          locale (const locale& loc, const char* name, category);
          template <class Facet>
            locale (const locale& loc, Facet* fp);
          locale (const locale& loc, const locale& loc2, category);

          // assignment operator
          const locale& operator= (const locale& loc) throw();
          template <class Facet>
            locale combine (const locale& loc);

          // destructor
          ~locale() throw();

          //name (if any)
          basic_string<char> name() const;

          // comparisons
          bool operator== (const locale& loc) const;
          bool operator!= (const locale& loc) const;

          //sorting of strings
          template <class charT, class Traits, class Allocator>
            bool operator() (
              const basic_string<charT,Traits,Allocator>& s1,
              const basic_string<charT,Traits,Allocator>& s2) const;
         };

         //facet access
         template <class Facet>
           const Facet& use_facet (const locale&);
         template <class Facet>
           bool has_facet (const locale&) throw();
      }

The strange thing about locales is how the objects stored in the container are accessed. A facet in a locale is accessed using the type of the facet as the index. Because each facet exposes a different interface and suits a different purpose, it is desirable to have the access function to locales return a type corresponding to the index. This is exactly what can be done with a type as the index. Using the facet's type as an index has the additional advantage of having a type-safe interface.

Locales are immutable. This means the facets stored in a locale cannot be changed (except when locales are being assigned). Variations of locales are created by combining existing locales and facets to create a new locale. Table 14.4 lists the constructors for locales.

Table 14.4. Constructing Locales

Expression Effect
locale() Creates a copy of the current global locale
locale (name) Creates a locale from the string name
locale (loc) Creates a copy of locale loc
locale (loc1,loc2, cat) Creates a copy of locale loc1, with all facets from category cat replaced with facets from locale loc2
locale (loc,name,cat) Equivalent to locale(loc, locale (name) ,cat)
locale (loc,fp) Creates a copy of locale loc and installs the facet to which fp refers
loc1 = loc2 Assigns locale loc2 to locale loc1
loc1.template combined<F> (loc2) Creates a copy of locale loc1 but with the facet of type F taken from loc2

Almost all constructors create a copy of some other locale. Merely copying a locale is considered to be a cheap operation. Basically, it consists of setting a pointer and increasing a reference count. Creating a modified locale is more expensive. In this case, a reference count for each facet stored in the locale has to be adjusted. Although the standard makes no guarantees about such efficient behavior, it is likely that all implementations will be rather efficient for copying locales.

Two of the constructors listed in Table 14.4 take names of locales. The names accepted are not standardized, with the exception of the name C. However, the standard requires that the documentation with the C++ standard library lists the accepted names. It is assumed that most implementations will accept names as outlined in Section 14.2.

The member function combine() needs some explanation because it uses a feature that was implemented in compilers only recently. It is a member function template with an explicitly specified template argument. This means the template argument is not deduced implicitly from an argument because there is no argument from which the type can be deduced. Instead, the template argument is specified explicitly (type F in this case).

The two functions that access facets in a locale object use the same technique (Table 14.5). The major difference is that these two functions are global template functions, thereby making this ugly syntax involving the template keyword unnecessary.

The function use_facet() returns a reference to a facet. The type of this reference is the type passed explicitly as the template argument. If the locale passed as the argument does not contain a corresponding facet, the function throws a bad_cast exception. The function has_facet() can be used to test whether a particular facet is present in a given locale.

Table 14.5. Accessing Facets

Expression Effect
has_facet<F>(loc) Returns true if a facet of type F is stored in locale loc
use_facet<F> (loc) Returns a reference to the facet of type F stored in locale loc

The remaining operations of locales are listed in Table 14.6. The name of a locale is maintained if the locale was constructed from a name, or one or more named locales. However, again, the standard makes no guarantees about the construction of a name resulting from combining two locales. Two locales are considered to be identical if one is a copy of the other or if both locales have the same name. It is natural to consider two objects to be identical if one is a copy of the other. But what about this naming stuff? The idea behind this is basically that the name of the locale reflects the names used to construct the named facets. For example, the locale's name might be constructed by joining the names of the facets in a particular order, separating the individual names by separation characters. Using this scheme it would possible to identify two locale objects as identical if they are constructed by combining the same named facets into locale objects. In other words, the standard basically requires that two locales consisting of the same set of named facets be considered identical. Thus, the names will probably be constructed carefully to support this notion of equality.

Table 14.6. Operations of Locales

Expression Effect
loc.name() Returns the name of locale loc as string
loc1 == loc2 Returns true if loc1 and loc2 are identical locales
loc1 != loc2 Returns true if loc1 and loc2 are different locales
loc(str1 ,str2) Returns the Boolean result of comparing strings str1 and str2 for ordering (whether str1 is less than str2)
locale::classic() Returns locale("C")
locale::global (loc) Installs loc as the global locale and returns the previous global locale

The parentheses operator makes it possible to use a locale object as a comparator for strings. This operator uses the string comparison from the collate facet to compare the strings passed as the argument for ordering. Thus, it returns whether one string is less than the other string according to the locale object. This is the behavior of an STL function object (see Section 8.1,), so you can use a locale object as a sorting criterion for STL algorithms that operate on strings. For example, a vector of strings can be sorted according to the rules for string collation of the German locale as follows:

    std::vector<std::string> v;
    ...
    // sort strings according to the German locale
    std::sort (v.begin(),v.end(),    //range
               locale("de_DE"));     //sorting criterion

Facets in Detail

The important aspect of locales are the contained facets. All locales are guaranteed to contain certain standard facets. The description of the individual facets in the following subsections provides which instantiations of the corresponding facet are guaranteed. In addition to these facets, an implementation of the C++ standard library may provide additional facets in the locales. What is important is that the user can also install her own facets or replace standard ones.

Section 14.2.2, discussed how to install a facet in a locale. For example, the class germanBoolNames was derived from the class numpunct_byname<char>, one of the standard facets, and installed in a locale using the constructor, taking a locale and a facet as arguments. But what do you need to create your own facet? Every class F that conforms to the following two requirements can be used as a facet:

  1. F derives publically from class locale::facet. This base class mainly defines some mechanism for reference counting that is used internally by the locale objects. It also declares the copy constructor and the assignment operator to be private, thereby making it infeasible to copy or to assign facets.

  2. F has a publically accessible static member named id of type locale::id. This member is used to look up a facet in a locale using the facet's type. The whole issue of using a type as the index is to have a type-safe interface. Internally, a normal container with an integer as the index is used to maintain the facets.

The standard facets conform not only to these requirements but also to some special implementation guidelines. Although conforming to these guidelines is not required, doing so is useful. The guidelines are as follows:

  1. All member functions are declared to be const. This is useful because use_facet() returns a reference to a const facet. Member functions that are not declared to be const can't be invoked.

  2. All public functions are nonvirtual and delegate each request to a protected virtual function. The protected function is named like the public one, with the addition of a leading do_. For example, numpunct::truename() calls numpunct::do_truename(). This style is used to avoid hiding member functions when overriding only one of several virtual member functions that has the same name. For example, the class num_put has several functions named put(). In addition, it gives the programmer of the base class the possibility of adding some extra code in the nonvirtual functions, which is executed even if the virtual function is overridden.

The following description of the standard facets concerns only the public functions. To modify the facet you have always to override the corresponding protected functions. If you define functions with the same interface as the public facet functions, they would only overload them because these functions are not virtual.

For most standard facets, a "_byname" version is defined. This version derives from the standard facet and is used to create an instantiation for a corresponding locale name. For example, the class numpunct_byname is used to create the numpunct facet for a named locale. For example, a German numpunct facet can be created like this:

    std::numpunct_byname("de_DE")

The _byname classes are used internally by the locale constructors that take a name as an argument. For each of the standard facets supporting a name, the corresponding _byname class is used to construct an instance of the facet.

Numeric Formatting

Numeric formatting converts between the internal representation of numbers and the corresponding textual representations. The iostream operators delegate the actual conversion to the facets of the locale::numeric category. This category is formed by three facets:

  1. numpunct, which handles punctuation symbols used for numeric formatting and parsing

  2. num_put, which handles numeric formatting

  3. num_get, which handles numeric parsing

In short, the facet num_put does the numeric formatting described for iostreams in Section 13.7, and num_get parses the corresponding strings. Additional flexibility not directly accessible through the interface of the streams is provided by the numpunct facet.

Numeric Punctuation

The numpunct facet controls the symbol used as the decimal point, the insertion of optional thousands separators, and the strings used for the textual representation of Boolean values. Table 14.7 lists the members of numpunct.

Table 14.7. Members of the numpunct Facet

Expression Meaning
np.decimal_point() Returns the character used as the decimal point
np.thousands_sep() Returns the character used as the thousands separator
np.grouping() Returns a string describing the positions of the thousands separators
np.truename() Returns the textual representation of true
np.falsename() Returns the textual representation of false

numpunct takes a character type charT as the template argument. The characters returned from decimal_point() and thousand_sep() are of this type, and the functions truename() and falsename() return a basic_string<charT>. The two instantiations numpunct<char> and numpunct<wchar_t> are required.

Because long numbers are hard to read without intervening characters, the standard facets for numeric formatting and numeric parsing support thousands separators. Often, the digits representing an integer are grouped into triples. For example, one million is written like this:

    1,000,000

Unfortunately, it is not used everywhere exactly like that. For example, in German a period is used instead of a comma. Thus, a German would write one million like this:

    1.000.000

This difference is covered by the thousands_sep() member. But this is not sufficient because in some countries digits are not put into triples. For example, in Nepal people would write

    10.00.000

using even different numbers of digits in the groups. This is where the string returned from the function grouping() comes in. The number stored at index i gives the number of digits in the ith group, where counting starts with zero for the rightmost group. If there are fewer characters in the string than groups, the size of the last specified group is repeated. To create unlimited groups, you can use the value numeric_limits<char>: :max() or, if there is no group at all, the empty string.Table 14.8 lists some examples of the formatting of one million.

Table 14.8. Examples of Numeric Punctuation of One Million

String Result
{ 0 } or "" (the default for grouping()) 1000000
{ 3, 0 } or "3" 1,000,000
{ 3, 2, 3, 0 } or "323" 10,00,000
{ 2, CHAR_MAX, 0 } 10000,00

Note that normal digits are usually not very useful. For example, the string "2" specifies groups of 50 digits for ASCII encoding because the character '2' has the integer value 50 in the ASCII character set.

Numeric Formatting

The num_put facet is used for textual formatting of numbers. It is a template class that takes two template arguments: the type charT of the characters to be produced and the type OutIt of an output iterator to the location at which the produced characters are written. The output iterator defaults to ostreambuf_iterator<charT>. The num_put facet provides a set of functions, all called put() and differing only in the last argument. You can use the facet as follows:

    std::locale      loc;
    OutIt            to = ...;
    std: : ios_base& fmt = ...;
    charT            fill = ...;
    T                value = ...;


    //get numeric output facet of the loc locale 
    const std::num_put<charT,OutIt>& np 
     = std::use_facet<std::num_put<charT,OutIt>(loc);


    //write value with numeric output facet
    np.put(to, fmt, fill, value);

These statements would produce a textual representation of the value value using characters of type charT written to the output iterator to. The exact format is determined from the formatting flags stored in fmt, where the character fill is used as a fill character. The put() function returns an iterator pointing immediately after the last character written.

The facet num_put provides member functions that take objects of types bool, long, unsigned long, double, long double, and void* as the last argument. It does not provide member functions, for example, for short or int. This is no problem because corresponding values of built-in types are promoted to supported types if necessary.

The standard requires that the two instantiations num_put<char> and num_put<wchar_t> are stored in each locale (both using the default for the second template argument). In addition, the C++ standard library supports all instantiations that take a character type as the first template argument and an output iterator type as the second. Of course, it is not required that all of these instantiations are stored in each locale because this would be an infinite amount of facets.

Numeric Parsing

The facet num_get is used to parse textual representations of numbers. Corresponding to the facet num_put, it is a template that takes two template arguments: the character type charT and an input iterator type InIt, which defaults to istreambuf_iterator<charT>. It provides a set of get() functions that differ only in the last argument. You can use the facet as follows:[4]

    std::locale      loc;               // locale
    InIt             beg = ...;         // begin of input sequence
    InIt             end = ...;         // end of input sequence
    std::ios_base&   fmt = ...;     // stream which defines input format
    std::ios_base::iostate err;         // state after call
    T             value;                // value after successful call


    //get numeric input facet of the loc locale
    const std::num_get<charT,InIt>& ng
    = std::use_facet<std::num_get<charT,InIt> > (loc);


    // read value with numeric input facet
    ng.get(beg, end, fmt, err, value);

These statements attempt to parse a numeric value corresponding to the type T from the sequence of characters between beg and end. The format of the expected numeric value is defined by the argument fmt. If the parsing fails, err is modified to contain the value ios_base: :failbit. Otherwise, ios_base: :goodbit is stored in err and the parsed value in value. The value of value is modified only if the parsing is successful. get() returns the second parameter (end) if the sequence was used completely. Otherwise, it returns an iterator pointing to the first character that could not be parsed as part of the numeric value.

The facet num_get supports functions to read objects of the types bool, long, unsigned short, unsigned int, unsigned long, float, double, long double, and void*. There are some types for which there is no corresponding function in the num_put facet; for example, unsigned short. This is because writing a value of type unsigned short produces the same result as writing a value of type unsigned short promoted to an unsigned long. However, reading a value as type unsigned long and then converting it to unsigned short may yield a different value than reading it as type unsigned short directly.

The standard requires that the two instantiations num_get<char> and num_get<wchar_t> be stored in each locale (both using the default for the second template argument). In addition, the C++ standard library supports all instantiations that take a character type as the first template argument and an input iterator type as the second. As with num_put, not all supported instantiations are required to be present in all locale objects.

Time and Date Formatting

The two facets time_get and time_put in the category time provide services for parsing and formatting times and dates. This is done by the member functions that operate on objects of type tm. This type is defined in the header tile <ctime>. The objects are not passed directly; rather, a pointer to them is used as the argument.

Both facets in the time category depend heavily on the behavior of the function strftime() (also defined in the header file <ctime>). This function uses a string with conversion specifiers to produce a string from a tm object. Table 14.9 provides a brief summary of the conversion specifiers. The same conversion specifiers are also used by the time_put facet.

Of course, the exact string produced by strftime() depends on the C locale in effect. The examples in the table are given for the "C" locale.

Time and Date Parsing

The facet time_get is a template that takes a character type charT and an input iterator type InIt as template arguments. The input iterator type defaults to istreambuf_iterator<charT>. Table 14.10 lists the members defined for the time_get facet. All of these members, except date_order(), parse the string and store the results in the tm object pointed to by the argument t. If the string could not be parsed correctly, either an error is reported (for example, by modifying the argument err) or an unspecified value is stored. This means that a time produced by a program can be parsed reliably but user input cannot. With the argument fmt, other facets used during parsing are determined. Whether other flags from fmt have any influence on the parsing is not specified.

All functions return an iterator that has the position immediately after the last character read. The parsing stops if parsing is complete or if an error occurs (for example, because a string could not be parsed as a date).

A function reading the name of a weekday or a month reads both abbreviated names and full names. If the abbreviation is followed by a letter, which would be legal for a full name, the function attempts to read the full name. If this fails, the parsing fails, even though an abbreviated name was already parsed successfully.

Table 14.9. Conversion Specifiers for strftime()

Specifier Meaning Example
%a Abbreviated weekday Mon
%A Full weekday Monday
%b Abbreviated month name Jul
%B Full month name July
%c Locale's preferred date and time representation Jul 12 21:53:22 1998
%d Day of the month 12
%H Hour of the day using a 24-hour clock 21
%I Hour of the day using a 12-hour clock 9
%j Day of the year 193
%m Month as decimal number 7
%M Minutes 53
%P Morning or evening (am or pm) pm
%S Seconds 22
%U Week number starting with the first Sunday 28
%W Week number starting with the first Monday 28
%w Weekday as a number (Sunday == 0) 0
%x Locale's preferred date representation Jul 12 1998
%X Locale's preferred time representation 21:53:22
%y The year without the century 98
%Y The year with the century 1998
%Z The time zone MEST
%% The literal % %

Whether a function that is parsing a year allows two-digit years is unspecified. The year that is assumed for a two-digit year, if it is allowed, is also unspecified.

date_order() returns the order in which the day, month, and year appear in a date string. This is necessary for some dates because the order cannot be determined from the string representing a date. For example, the first day in February in the year 2003 may be printed either as 3/2/1 or as 1/2/3. Class time_base, which is the base class of the facet time_get, defines an enumeration called dateorder for possible date order values. Table 14.11 lists these values.

The standard requires that the two instantiations time_get<char> and time_get<wchar_t> are stored in each locale. In addition, the C++ standard library supports all instantiations that take char or wchar_t as the first template argument, and a corresponding input iterator as the second. All of these instantiations are not required to be stored in each locale object.

Table 14.10. Members of the time_get Facet

Expression Meaning
tg.get_time (beg , to , fmt , err , t ) Parses the string between beg and end as the time produced by the X specifier for strftime()
tg.get_date(beg,end,fmt ,err,t) Parses the string between beg and end as the date produced by the x specifier for strftime()
tg.get_weekday (beg, end , fmt , err , t ) Parses the string between beg and end as the name of the weekday
tg.get_monthname (beg , end , fmt , err , t ) Parses the string between beg and end as the name of the month
tg.get_year (beg, end , fmt , err , t ) Parses the string between beg and end as the year
tg.date_order( ) Returns the date order used by the facet

Table 14.11. Members of the Enumeration dateorder

Value Meaning
no_order No particular order (for example, a date may be in Julian format)
dmy The order is day, month, year
mdy The order is month, day, year
ymd The order is year, month, day
ydm The order is year, day, month

Time and Date Formatting

The facet time_put is used for formatting times and dates. It is a template that takes as arguments a character type charT and an optional output iterator type Out It. The latter defaults to type ostreambuf_iterator (see page 665).

The facet time_put defines two functions called put(), which are used to convert the date information stored in an object of type tm into a sequence of characters written to an output iterator. Table 14.12 lists the members of the facet time_put.

Table 14.12. Members of the time_put Facet

Expression Meaning
tp.put (to , fmt ,fill , t , cbeg , cend) Converts according to the string [cbeg,cend)
tp.put (to , fmt , fill , t , cvt ,mod) Converts using the conversion specifier cvt

Both functions write their results to the output iterator to and return an iterator pointing immediately after the last character produced. The argument fmt is of type ios_base and is used to access other facets and potentially additional formatting information. The character fill is used when a space character is needed and for filling. The argument t points to an object of type tm that is storing the date to be formatted.

The version of put() that takes two characters as the last two arguments formats the date found in the tm object to which t refers, interpreting the argument cvt like a conversion specifier to strftime(). This put() function does only one conversion; namely, the one specified by the cvt character. This function is called by the other put() function for each conversion specifier found. For example, using 'X' as the conversion specifier results in the time that is stored in *t being written to the output iterator. The meaning of the argument mod is not defined by the standard. It is intended to be used as a modifier to the conversion as found in several implementations of the strftime() function.

The version of put() that takes a string defined by the range [cbeg,cend) to guide the conversion behaves very much like strftime(). It scans the string and writes every character that is not part of a conversion specification to the output iterator to. If it encounters a conversion specification introduced by the character %, it extracts an optional modifier and a conversion specifier. The function continues by calling the other version of put(), using the conversion specifier and the modifier as the last two arguments. After processing a conversion specification, put() continues to scan the string.

Note that this facet is somewhat unusual because it provides a nonvirtual member function; namely, the function put(), which uses a string as the conversion specification. This function cannot be overridden in classes derived from time_put. Only the other put() function can be overridden.

The standard requires that the two instantiations time_put<char> and time_put<wchar_t> are stored in each locale. In addition, the C++ standard library supports all instantiations that take char or wchar_t as the first template argument and a corresponding output iterator as the second. There is no guaranteed support for instantiations using a type other than char or wchar_t as the first template argument. Also, it is not guaranteed that any instantiations other than time_put<char> and time_put<wchar_t> be stored in locale objects by default.

Monetary Formatting

The category monetary consists of the facets moneypunct, money_get, and money_put. The facet moneypunct defines the format of monetary values. The other two use this information to format or to parse a monetary value.

Monetary Punctuation

Monetary values are printed differently depending on the context. The formats used in different cultural communities differ widely. Examples of the varying details are the placement of the currency symbol (if present at all), the notation for negative or positive values, the use of national or international currency symbols, and the use of thousands separators. To provide the necessary flexibility, the details of the format are factored into the facet moneypunct.

The facet moneypunct is a template that takes as arguments a character type charT and a Boolean value that defaults to false. The Boolean value indicates whether local (false) or international (true) currency symbols are to be used. Table 14.13 lists the members of the facet moneypunct.

Table 14.13. Members of the moneypunct Facet

Expression Meaning
mp.decimal_point() Returns a character to be used as the decimal point
mp.thousands_sep() Returns a character to be used as the thousands separator
mp.grouping() Returns a string specifying the placement of the thousands separators
mp.curr_symbol() Returns a string with the currency symbol
mp.positive_sign() Returns a string with the positive sign
mp.negative_sign() Returns a string with the negative sign
mp.frac_digits() Returns the number of fractional digits
mp.pos_format() Returns the format to be used for non-negative values
mp.neg_format() Returns the format to be used for negative values

moneypunct derives from the class money_base. This base class defines an enumeration called part, which is used to form a pattern for monetary values. The class also defines a type called pattern (which is actually a type definition for char [4]). This type is used to store four values of type part that form a pattern describing the layout of a monetary value. Table 14.14 lists the five possible parts that can be placed in a pattern.

Table 14.14. Parts of Monetary Layout Patterns

Value Meaning
none At this position, spaces may appear but are not required
space At this position, at least one space is required
sign At this position, a sign may appear
symbol At this position, the currency symbol may appear
value At this position, the value appears

moneypunct defines two functions that return patterns: the function neg_format() for negative values and the function pos_format() for non-negative values. In a pattern, each of the parts sign, symbol, and value is mandatory, and one of the parts none and space has to appear. This does not mean, however, that there is really a sign or a currency symbol printed. What is printed at the positions indicated by the parts depends on the values returned from other members of the facet and on the formatting flags passed to the functions for formatting.

Only the value always appears. Of course, it is placed at the position where the part value appears in the pattern. The value has exactly frac_digits() fractional digits, with decimal_point() used as the decimal point (unless there are no fractional digits, in which case no decimal point is used).

When reading monetary values, thousand separators are allowed but not required in the input. When present they are checked for correct placements according to grouping(). If grouping () is empty, no thousand separators are allowed. The character used for the thousands separator is the one returned from thousands_sep(). The rules for the placement of the thousands separators are identical to the rules for numeric formatting (see page 705). When monetary values are printed, thousands separators are always inserted according to the string returned from grouping(). When monetary values are read, thousands separators are optional unless the grouping string is empty. The correct placement of thousands separators is checked after all other parsing is successful.

The parts space and none control the placement of spaces. space is used at a position where at least one space is required. During formatting, if ios_base::internal is specified in the format flags, fill characters are inserted at the position of the space or the none part. Of course, filling is done only if the minimum width specified is not used with other characters. The character used as the space character is passed as the argument to the functions for the formatting of monetary values. If the formatted value does not contain a space, none can be placed at the last position. space and none may not appear as the first part in a pattern, and space may not be the last part in a pattern.

Signs for monetary values may consist of more than one character. For example, in certain contexts parentheses around a value are used to indicate negative values. At the position where the sign part appears in the pattern, the first character of the sign appears. All other characters of the sign appear at the end after all other components. If the string for a sign is empty, no character indicating the sign appears. The character that is to be used as a sign is determined with the function positive_sign() for non-negative values and negative_sign() for negative values.

At the position of the symbol part, the currency symbol appears. The symbol is present only if the formatting flags used during formatting or parsing have the ios_base::showbase flag set. The string returned from the function curr_symbol() is used as the currency symbol. The currency symbol is a local symbol to be used to indicate the currency if the second template argument is false (the default). Otherwise, an international currency symbol is used.

Table 14.15 illustrates all of this, using the value $-1234.56 as an example. Of course, this means that frac_digits() returns 2. In addition, a width of 0 is always used.

The standard requires that the instantiations moneypunct<char>, moneypunct<wchar_t>, moneypunct<char, true>, and moneypunct<wchar_t, true> are stored in each locale. The C++ standard library does not support any other instantiation.

Monetary Formatting

The facet money_put is used to format monetary values. It is a template that takes a character type charT as the first template argument and an output iterator OutIt as the second. The output iterator defaults to ostreambuf _iterator<charT>. The two member functions put() produce a sequence of characters corresponding to the format specified by a moneypunct facet. The value to be formatted is either passed as type long double or as type basic_string<charT>. You can use the facet as follows:

Table 14.15. Examples of Using the Monetary Pattern

Pattern Sign Result
symbol none sign value  $1234.56
symbol none sign value - $-1234.56
symbol space sign value - $ -1234.56
symbol space sign value ( ) $ (1234.56)
sign symbol space value ( ) ($ 1234.56)
sign value space symbol - (1234.56 $)
symbol space value sign - $ 1234.56-
sign value space symbol - -1234.56 $
sign value none symbol - -1234.56 $
//get monetary output facet of the loc locale
    const std::money_put<charT,OutIt>& mp
    = std::use_facet<std::money_put<charT,OutIt> >(loc);


    // write value with monetary output facet
    mp.put(to, intl, frat, fill, value);

The argument to is an output iterator of type OutIt to which the formatted string is written. put() returns an object of this type pointing immediately after the last character produced. The argument intl indicates whether a local or an international currency symbol is to be used. fmt is used to determine formatting flags, such as the width to be used and the moneypunct facet defining the format of the value to be printed. Where a space character has to appear, the character fill is inserted.

The argument value has type long double or type basic_string<charT>. This is the value that is formatted. If the argument is a string, this string may consist only of decimal digits with an optional leading minus sign. If the first character of the string is a minus sign, the value is formatted as a negative value. After it is determined that the value is negative, the minus sign is discarded. The number of fractional digits in the string is determined from the member function frac_digits() of the moneypunct facet.

The standard requires that the two instantiations money_put<char> and money_put<wchar_t> are stored in each locale. In addition, the C++ standard library supports all instantiations that take char or wchar_t as the first template argument and a corresponding output iterator as the second. All of these instantiations are not required to be stored in each locale object.

Monetary Parsing

The facet money_get is used for parsing of monetary values. It is a template class that takes a character type charT as the first template argument and an input iterator type InIt as the second. The second template argument defaults to istreambuf_iterator<charT>. This class defines two member functions called get() that try to parse a character and, if the parse is successful, store the result in a value of type long double or of type basic_string<charT>. You can use the facet as follows:

//get monetary input facet of the loc locale
    const std::money_get<charT,InIt>& mg 
     = std::use_facet<std::money_get<charT,InIt> >(loc);


    //read value with monetary input facet
    mg.get(beg, end, intl, fmt, err, val);

The character sequence to be parsed is defined by the sequence between beg and end. The parsing stops as soon as either all elements of the used pattern are read or an error is encountered. If an error is encountered, the ios_base::failbit is set in err and nothing is stored in val. If parsing is successful, the result is stored in the value of types long double or basic_string that is passed by reference as argument val.

The argument intl is a Boolean value that selects a local or an international currency string. The moneypunct facet defining the format of the value to be parsed is retrieved using the locale object imbued by the argument fmt. For parsing a monetary value, the pattern returned from the member neg_format() of the moneypunct facet is always used.

At the position of none or space, the function that is parsing a monetary value consumes all available space, unless none is the last part in a pattern. Trailing spaces are not skipped. The get() functions return an iterator that points after the last character that was consumed.

The standard requires that the two instantiations money_get<char> and money_get<wchar_t> be stored in each locale. In addition, the C++ standard library supports all instantiations that take char or wchar_t as the first template argument and a corresponding input iterator as the second. All of these instantiations are not required to be stored in each locale object.

Character Classification and Conversion

The C++ standard library defines two facets to deal with characters: ctype and codecvt. Both belong to the category locale::ctype. The facet ctype is used mainly for character classification, such as testing whether a character is a letter. It also provides methods for conversion between lowercase and uppercase letters and for conversion between char and the character type for which the facet is instantiated. The facet codecvt is used to convert characters between different encodings and is used mainly by basic_filebuf to convert between external and internal representations.

Character Classification

The facet ctype is a template class parameterized with a character type. Three kinds of functions are provided by the class ctype<charT>:

  1. Functions to convert between char and charT

  2. Functions for character classification

  3. Functions for conversion between uppercase and lowercase letters

Table 14.16 lists the members defined for the facet ctype.

Table 14.16. Services Defined by the ctype<charT> Facet

Expression Effect
ct.is(m,c) Tests whether the character c matches the mask m
ct.is(beg ,end, vec) For each character in the range between beg and end, places a mask matched by the character in the corresponding location of vec
ct.scan_is(m,beg,end) Returns a pointer to the first character in the range between beg and end that matches the mask m or end if there is no such character
ct.scan_not (m , beg , end) Returns a pointer to the first character in the range between beg and end that does not match the mask m or end if all characters match the mask
ct.toupper(c) Returns an uppercase letter corresponding to c if there is such a letter; otherwise c is returned
ct.toupper(beg,end) Converts each letter in the range between beg and end by replacing the letter with the result of toupper()
ct.tolower(c) Returns a lowercase letter corresponding to c if there is such a letter; otherwise c is returned
ct.tolower(beg,end) Converts each letter in the range between beg and end by replacing the letter with the result of tolower()
ct.widen(c) Returns the char converted to charT
ct.widen(beg, end, dest) For each character in the range between beg and end, places the result of widen() at the corresponding location in dest
ct.narrow (c , default) Returns the charT c converted to char, or the char default if there is no suitable character
ct.narrow (beg, end, default, dest) For each character in the range between beg and end,places the result of narrow() at the corresponding location in dest

The function is(beg,end, vec) is used to store a set of masks in an array. For each of the characters in the range between beg and end, a mask with the attributes corresponding to the character is stored in the array pointed to by vec. This is useful to avoid virtual function calls for the classification of characters if there are lots of characters to be classified.

The function widen() can be used to convert a character of type char from the native character set to the corresponding character in the character set used by a locale. Thus, it makes sense to widen a character even if the result is also of type char. For the opposite direction, the function narrow() can be used to convert a character from the character set used by the locale to a corresponding char in the native character set, provided there is such a char. For example, the following code converts the decimal digits from char to wchar_t:

    std::locale loc;
    char narrow[] = "0123456789";
    wchar_t wide [10];


    std::use_facet<std::ctype<wchar_t> >(loc).widen(narrow, narrow+10,
                                                    wide);

Class ctype derives from the class ctype_base. This class is used only to define an enumeration called mask. This enumeration defines values that can be combined to form a bitmask used for testing character properties. The values defined in ctype_base are shown in Table 14.17. The functions for character classification all take a bitmask as an argument, which is formed by combinations of the values defined in ctype_base. To create bitmasks as needed, you can use the operators for bit manipulation (|, &,^, and ~). A character matches this mask if it is any of the characters identified by the mask.

Table 14.17. Character Mask Values Used by ctype

Value Meaning
ctype_base::alnum Tests for letters and digits (equivalent to alpha | digit)
ctype_base:: alpha Tests for letters
ctype_base::cntrl Tests for control characters
ctype_base:: digit Tests for decimal digits
ctype_base:: graph Tests for punctuation characters, letters, and digits (equivalent to alnum | punct)
ctype_base :: lower Tests for lowercase letters
ctype_base:: print Tests for printable characters
ctype_base::punct Tests for punctuation characters
ctype_base :: space Tests for space characters
ctype_base:: upper Tests for uppercase letters
ctype_base::xdigit Tests for hexadecimal digits

Specialization of ctype<> for Type char

For better performance of the character classification functions, the facet ctype is specialized for the character type char. This specialization does not delegate the functions dealing with character classification (is(), scan_is(), and scan_not()) to corresponding virtual functions. Instead, these functions are implemented inline using a table lookup. For this case additional members are provided (Table 14.18).

Table 14.18. Additional Members of ctype<char>

Expression Effect
ctype<char>::table_size Returns the size of the table (>=256)
ctype<char>:: classic_table() Returns the table for the "classic" C locale
ctype<char> (table,del=false) Creates the facet with table table
ct.table() Returns the current table of facet ct

Manipulating the behavior of these functions for specific locales is done with a corresponding table of masks that is passed as a constructor argument:

// create and initialize the table
    std::ctype_base::mask mytable[std::ctype<char>::table_size] = {
         ...
    };


    // use the table for the  ctype<char>facet ct
                 std::ctype<char> ct(mytable, false);

This code constructs a ctype<char> facet that uses the table mytable to determine the character class of a character. More precisely, the character class of the character c is determined by

    mytable[static_cast<unsigned char>(c)]

The static member table_size is a constant defined by the library implementation and gives the size of the lookup table. This size is at least 256 characters. The second optional argument to the constructor of ctype<char> indicates whether the table should be deleted if the facet is destroyed. If it is true, the table passed to the constructor is released by using delete [] when the facet is no longer needed.

The member function table() is a protected member function that returns the table that is passed as the first argument to the constructor. The static protected member function classic_table() returns the table that is used for character classification in the classic C locale.

Global Convenience Functions for Character Classification

Convenient use of the ctype facets is provided by predefined global functions. Table 14.19 lists all of the global functions.

Table 14.19. Global Convenience Functions for Character Classification

Function Effect
isalnum(c, loc) Returns whether c is a letter or a digit (equivalent to isalpha()&&isdigit())
isalpha(c, loc) Returns whether c is a letter
iscntrl(c, loc) Returns whether c is a control character
isdigit(c, loc) Returns whether c is a digit
isgraph(c, loc) Returns whether c is a printable, nonspace character (equivalent to isalnum()&&ispunct())
islower(c, loc) Returns whether c is a lowercase letter
isprint (c, loc) Returns whether c is a printable character (including whitespaces)
ispunct(c, loc) Returns whether c is a punctuation character (that is, it is printable, but it is not a space, digit, or letter)
isspace(c, loc) Returns whether c is a space character
isupper(c, loc) Returns whether c is an uppercase letter
isxdigit(c, loc) Returns whether c is a hexadecimal digit
tolower(c, loc) Converts c from an uppercase letter to a lowercase letter
toupper(c, loc) Converts c from a lowercase letter to an uppercase letter

For example, the following expression determines whether the character c is a lowercase letter in the locale loc:

    std::islower(c,loc)

It returns a corresponding value of type bool.

The following expression returns the character c converted to an uppercase letter, if c is a lowercase letter in the locale loc:

    std::toupper(c,loc)

If c is not a lowercase letter, the first argument is returned unmodified.

The expression

    std::islower(c,loc)

is equivalent to the following expression:

    std::use_facet<std::ctype<char> >(loc).is(std::ctype_base::lower,c)

This expression calls the member function is() of the facet ctype<char>. is() determines whether the character c fulfills any of the character properties that are passed as the bitmask in the first argument. The values for the bitmask are defined in the class ctype_base. See page 502 and page 669 for examples of the use of these convenience functions.

The global convenience functions for character classification correspond to C functions that have the same name but only the first argument. They are defined in <cctype> and <ctype.h>, and always use the current global C locale.[4] Their use is even more convenient:

    if (std::isdigit(c)) {
        ...
    }

However, by using them you can't use different locales in the same program. Also, you can't use a user-defined ctype facet using the C function. See page 497 for an example that demonstrates how to use these C functions to convert all characters of a string to uppercase letters.

It is important to note that the C++ convenience functions should not be used in code sections where performance is crucial. It is much faster to obtain the corresponding facet from the locale and use the functions on this object directly. If a lot of characters are to be classified according to the same locale, this can be improved even more, at least for non-char characters. The function is(beg,end,vec) can be used to determine the masks for typical characters: This function determines for each character in the range [beg,end)amask that describes the properties of the character. The resulting mask is stored in vec at the position corresponding to the character's position. This vector can then be used for fast lookup of the characters.

Character Encoding Conversion

The facet codecvt is used to convert between internal and external character encoding. For example, it can be used to convert between Unicode and EUC (Extended UNIX Code), provided the implementation of the C++ standard library supports a corresponding facet.

This facet is used by the class basic_filebuf to convert between the internal representation and the representation stored in a file. The class basic_filebuf <charT,traits> (see page 627) uses the instantiation codecvt<charT,char,typename traits::state_type> to do so. The facet used is taken from the locale stored with basic_filebuf. This is the major application of the codecvt facet. Only rarely is it necessary to use this facet directly.

In Section 14.1, some basics of character encodings are introduced. To understand codecvt, you need to know that there are two approaches for the encoding of characters: One is character encodings that use a fixed number of bytes for each character (wide-character representation), and the other is character encodings that use a varying number of bytes per character (multibyte representation).

It is also necessary to know that multibyte representations use so-called shift states for space efficient representation of characters. The correct interpretation of a byte is possible only with the correct shift state at this position. This in turn can be determined only by walking through the whole sequence of multibyte characters (see Section 14.1, for more details).

The codecvt<> facet takes three template arguments:

  1. The character type internT used for an internal representation

  2. The type externT used to represent an external representation

  3. The type stateT used to represent an intermediate state during the conversion

The intermediate state may consist of incomplete wide characters or the current shift state. The C++ standard makes no restriction about what is stored in the objects representing the state.

The internal representation always uses a representation with a fixed number of bytes per character. Mainly the two types char and wchar_t are intended to be used within a program. The external representation may be a representation that uses a fixed size or a multibyte representation. When a multibyte representation is used, the second template argument is the type used to represent the basic units of the multibyte encoding. Each multibyte character is stored in one or more objects of this type. Normally, the type char is used for this.

The third argument is the type used to represent the current state of the conversion. It is necessary, for example, if one of the character encodings is a multibyte encoding. In this case, the processing of a multibyte character might be terminated because the source buffer is drained or the destination buffer is full while one character is being processed. If this happens, the current state of the conversion is stored in an object of this type.

Similar to the other facets, the standard requires support for only very few conversions. Only the following two instantiations are supported by the C++ standard library:

  1. codecvt<char,char,mbstate_t>, which converts the native character set to itself (this is actually a degenerated version of the codecvt facet)

  2. codecvt<wchar_t,char,mbstate_t>, which converts between the native tiny character set(that is, char) and the native wide-character set (that is, wchar_t)

The C++ standard does not specify the exact semantics of the second conversion. The only natural thing to do, however, is to split each wchar_t into sizeof(wchar_t) objects of type char for the conversion from wchar_t to char, and to assemble a wchar_t from the same amount of chars when converting in the opposite direction. Note that this conversion is very different from the conversion between char and wchar_t done by the widen() and narrow() member functions of the ctype facet: While the codecvt functions use the bits of multiple chars to form one wchar_t (or vice versa), the ctype functions convert a character in one encoding to the corresponding character in another encoding (if there is such a character).

Like the ctype facet, codecvt derives from a base class used to define an enumeration type. This class is named codecvt.base, and it defines an enumeration called result. The values of this enumeration are used to indicate the results of codecvt's members. The exact meanings of the values depend on the member function used. Table 14.20 lists the member functions of the codecvt facet.

The function in() converts an external representation to an internal representation. The argument s is a reference to a stateT. At the beginning, this argument represents the shift state used when the conversion is started. At the end, the final shift state is stored there. The shift state passed in can differ from the initial state if the input buffer to be converted is not the first buffer being converted. The arguments fb (from begin) and fe (from end) are of type const internT*, and represent the beginning and the end of the input buffer. The arguments tb (to begin) and te (to end) are of type externT*, and represent the beginning and the end of the output buffer. The arguments

Table 14.20. Members of the codecvt Facet

Expression Meaning
cvt.in(s,fb,fe,fn,tb,te,tn) Converts external representation to internal representation
cvt. out (s , fb , fe , fn , tb , te , tn) Converts internal representation to external representation
cvt.unshift(s,tb,te,tn) Writes escape sequence to switch to initial shift state
cvt.encoding() Returns information about the external encoding
cvt. always_noconv() Returns true if no conversion will ever be done
cvt.length(s,fb,fe,max) Returns the number of externTs from the sequence between fb and fe to produce max internal characters
cvt.max_length() Returns the maximum number of externTs necessary to produce one internT

fn (from next, of type const externT*&) and tn (to next, of type internT*&) are references used to return the end of the sequence converted in the input buffer and the output buffer respectively. Either buffer may reach the end before the other buffer reaches the end. The function returns a value of type codecvt_base:: result, as indicated in Table 14.21.

Table 14.21. Return Values of the Conversion Functions

Value Meaning
ok All source characters were converted successfully
partial Not all source characters were converted, or more characters are needed to produce a destination character
error A source character was encountered that cannot be converted
noconv No conversion was necessary

If ok is returned the function made some progress. If fn == fe holds, this means that the whole input buffer was processed and the sequence between tb and tn contains the result of the conversion. The characters in this sequence represent the characters from the input sequence, potentially with a finished character from a previous conversion. If the argument s passed to in() was not the initial state, a partial character from a previous conversion that was not completed could have been stored there.

If partial is returned, either the output buffer was full before the input buffer could be drained or the input buffer was drained when a character was not yet complete (for example, because the last byte in the input sequence was part of an escape sequence switching between shift states). If fe == fn, the input buffer was drained. In this case, the sequence between tb and tn contains all characters that were converted completely but the input sequence terminated with a partially converted character. The necessary information to complete this character's conversion during a subsequent conversion is stored in the shift state s. If fe ! = fn, the input buffer was not completely drained. In this case, te == tn holds; thus, the output buffer is full. The next time the conversion is continued, it should start with fn.

The return value noconv indicates a special situation. That is, no conversion was necessary to convert the external representation to the internal representation. In this case, fn is set to fb and tn is set to tb. Nothing is stored in the destination sequence because everything is already stored in the input sequence.

If error is returned, that means a source character that could not be converted was encountered. There are several reasons why this can happen. For example, the destination character set has no representation for a corresponding character, or the input sequence ends up with an illegal shift state. The C++ standard does not define any method that can be used to determine the cause of the error more precisely.

The function out() is equivalent to the function in(), except that it converts in the opposite direction. That is, it converts an internal representation to an external representation. The meanings of the arguments and the values returned are the same; only the types of the arguments are swapped. That is, tb and te now have the type const internT*, and fb and fe now have the type const externT*. The same applies to fn and tn.

The function unshift() inserts characters necessary to complete a sequence when the current state of the conversion is passed as the argument s. This normally means that a shift state is switched to the initial switch state. Only the external representation is terminated. Thus, the arguments tb and tf are of type externT*, and tn is of type externT&*. The sequence between tb and te defines the output buffer in which the characters are stored. The end of the result sequence is stored in tn. unshift() returns a value as shown in Table 14.22.

Table 14.22. Return Values of the Function unshift()

Value Meaning
ok The sequence was completed successfully
partial More characters need to be stored to complete the sequence
error The state is invalid
noconv No character was needed to complete the sequence

The function encoding() returns some information about the encoding of the external representation. If encoding() returns −1, the conversion is state dependent. If encoding() returns 0, the number of externTs needed to produce an internal character is not constant. Otherwise, the number of externTs need to produce an internT is returned. This information can be used to provide appropriate buffer sizes.

The function always_noconv() returns true if the functions in() and out() never perform a conversion. For example, the standard implementation of codecvt<char, char, mbstate_t> does no conversion, and thus, always_noconv() returns true for this facet. However, this only holds for the codecvt facet from the "C" locale. Other instances of this facet may actually do a conversion.

The function length() returns the number of externTs from the sequence between fb and fe necessary to produce max characters of type internT. If there are fewer than max complete internT characters in the sequence between fb and fe, the number of externTs used to produce a maximum number of internTs from the sequence is returned.

String Collation

The facet collate handles differences between conventions for the sorting of strings. For example, in German the letter "ü" is treated as being equivalent to the letter "u" or to the letters "ue" for the purpose of sorting strings. For other languages, this letter is not even a letter, and it is treated as a special character, when it is treated at all. Other languages use slightly different sorting rules for certain character sequences. The collate facet can be used to provide a sorting of strings that is familiar to the user. Table 14.23 lists the member functions of this facet. In this table, col is an instantiation of collate, and the arguments passed to the functions are iterators that are used to define strings.

Table 14.23. Members of the collate<> Facet

Expression Meaning
col.compare (beg1 ,end1 ,beg2,end2) Returns 1 if the first string is greater than the second 0 if both strings are equal −1 if the first string is smaller than the second
col.transform (beg ,end) Returns a string to be compared with other transformed strings
col.hash (beg , end) Returns a hash value (of type long) for the string

The collate facet is a class template that takes a character type charT as its template argument. The strings passed to collate's members are specified using iterators of type const charT*. This is somewhat unfortunate because there is no guarantee that the iterators used by the type basic_string<charT> are also pointers. Thus, strings have to be compared using something like this:

    locale loc;
    string s1, s2;
    ...
//get collate facet of the loc locale
    const std::collate<charT>& col 
    = std::use_facet<std::collate<charT> >(loc);


    //compare strings by using the collate facet
   int result = col.compare(s1.data(), si.data()+s1.size(),
                            s2.data(), s2.data()+s2.size());
    if (result == 0) {
    //s1 and s2 are equal
    ...
   }

The reason for this limitation is that you cannot predict which iterator types are necessary. It would be necessary to have collation facets for the pointer type and for an infinite amount of iterator types.

Of course, here the special convenience function of locale can be used to compare strings (see page 703):

    int result = loc(s1,s2);

But this works only for the compare() member function. There are no convenient functions defined by the C++ standard library for the other two members of collate.

The transform() function returns an object of type basic_string<charT>. The lexicographical order of strings returned from transform() is the same as the order of the original strings using collate(). This ordering can be used for better performance if one string has to be compared with many other strings. Determining the lexicographical order of strings can be much faster than using collate(). This is because the national sorting rules can be relatively complex.

The C++ standard library mandates support only for the two instantiations collate<char> and collate<wchar_t>. For other character types, users must write their own specializations, potentially using the standard instantiations.

Internationalized Messages

The messages facet is used to retrieve internationalized messages from a catalog of messages. This facet is intended primarily to provide a service similar to that of the function perror(). This function is used in POSIX systems to print a system error message for an error number stored in the global variable errno. Of course, the service provided by messages is more flexible. Unfortunately, it is not defined very precisely.

The messages facet is a template class that takes a character type charT as its template argument. The strings returned from this facet are of type basic_string<charT>. The basic use of this facet consists of opening a catalog, retrieving messages, and then closing the catalog. The class messages derives from a class messages_base, which defines a type catalog (actually, it is a type definition for int). An object of this type is used to identify the catalog on which the members of messages operate. Table 14.24 lists the member functions of the messages facet.

The name passed as the argument to the open() function identifies the catalog in which the message strings are stored. This can be, for example, the name of a file. The loc argument identifies a locale object that is used to access a ctype facet. This facet is used to convert the message to the desired character type.

The exact semantics of the get() member are not defined. An implementation for POSIX systems could, for example, return the string corresponding to the error message for error msgid, but this behavior is not required by the standard. The set argument is intended to create a substructure

Table 14.24. Members of the messages<> Facet

Expression Meaning
msg.open(name , loc) Opens a catalog and returns a corresponding ID
msg.get(cat,set,msgid,def) Returns the message with ID msgid from catalog cat; if there is no such message, def is returned instead
msg. close (cat) Closes the catalog

within the messages. For example, it might be used to distinguish between system errors and errors of the C++ standard library.

When a message catalog is no longer needed, it can be released using the close() function. Although the interface using open() and close() suggests that the messages are retrieved from a file as needed, this is by no means required. Actually, it is more likely that open() reads a file and stores the messages in memory. A later call to close() would then release this memory.

The standard requires that the two instantiations messages<char> and messages<wchar_t> be stored in each locale. The C++ standard library does not support any other instantiations.



[1] i18n is a common abbreviation for internationalization. It stands for the letter i, followed by 18 characters, followed by the letter n.

[2] Note that you have to put a space between the two ">" characters. ">>" would be parsed as shift operator, which would result in a syntax error.

[3] POSIX and X/Open are standards for operating system interfaces.

[4] See http://www/josuttis.com/libbook/examples.html for a complete example program.

[4] This locale is only identical to the global C++ locale if the last call to locale:: global() was with a named locale and if there was no call to setlocale() since then. Otherwise, the locale used by the C functions is different from the global C++ locale.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.7.13