As the global market has increased in importance, so has internationalization (or i18n for short)[1] become more important for software development. As a consequence, the C++ standard library provides concepts to write code for international programs. These concepts influence mainly the use of I/O and string processing. This chapter describes these concepts. Many thanks to Dietmar Kühl, who is an expert on I/O and internationalization in the C++ standard library and wrote major parts of this chapter.
The C++ standard library provides a general approach to support national conventions without being bound to specific conventions. This goes to the extent, for example, that strings are not bound to a specific character type to support 16-bit characters in Asia. For the internationalization of programs, two related aspects are important:
Different character sets have different properties. Handling them requires flexible solutions for problems, such as what is considered to be a letter or, worse, what type to use to represent characters. For character sets with more than 256 characters, type char
is not sufficient as a representation.
The user of a program expects to see national or cultural conventions obeyed (for example, the formatting of dates, monetary values, numbers, and Boolean values).
For both aspects, the C++ standard library provides related solutions.
The major approach toward internationalization is to use locale objects to represent an extensible collection of aspects to be adapted to specific local conventions. Locales are already used in C to adapt to specific local conventions. In the C++ standard, this mechanism was generalized and made more flexible. Actually, the C++ locale mechanism can be used to address all kinds of customization, depending on the user's environment or preferences. For example, it can be extended to deal with measurement systems, time zones, or paper size.
Most of the mechanisms of internationalization involve no or only minimal additional work for the programmer. For example, when doing I/O with the C++ stream mechanism, numeric values are formatted according to the rules of some locale. The only work for the programmer is to instruct the I/O stream classes to use the user's preferences.
In addition to such automatic use, the programmer may use locale objects directly for formatting, collation, character classification, and so on. Some internationalized aspects supported by the C++ standard library are not used by the C++ standard library itself, and to use them the programmer has to call those functions manually. For example, there are no stream functions defined in the C++ standard library that do time, date, or monetary formatting. To use these services, it is necessary to call them directly (for example, in user-defined stream operators writing objects of a money class).
Strings and streams use another concept for internationalization: character traits. They define fundamental properties and operations that differ for different character sets, such as the value of "end-of-file" as well as functions to compare, assign, and copy strings.
The classes for internationalization were introduced to the standard relatively late. Although the general approach is extremely flexible, it still needs some work to make it really complete. For example, the functions for string collation (that is, comparing strings for sorting according to some locale conventions) use only iterators of type const charT*,
where charT
is some character type. Although it is very likely that basic_string<charT>
uses this type as an iterator type, it is not at all guaranteed. Thus, it is not guaranteed that string iterators can be used as arguments to the functions for string collation. However, it is possible to use the result of basic_string data()
member functions with the string collation functions.
One area internationalization addresses is how to handle different character encodings. This issue arises mainly in Asia, where different encodings are used to represent the same character set. The issue normally comes in conjunction with character encodings that use more than 8 bits. To process such characters, it is necessary to use new concepts and functions for text processing.
Two different approaches are common to address character sets that have more than 256 characters: multibyte representation and wide-character representation:
With multibyte representation, the number of bytes used for a character is variable. A 1 -byte character, such as an ISO Latin-1 character, can be followed by a 3-byte character, such as a Japanese ideogram.
With wide-character representation, the number of bytes used to represent a character is always the same, independent of the character being represented. Typical representations use 2 or 4 bytes. Conceptually, this does not differ from representations that use just 1 byte for locales, where ISO Latin-1 or even ASCII is sufficient.
This multibyte representation is more compact than the wide-character representation. Thus, the multibyte representation is normally used to store data outside of programs. Conversely, it is much easier to process characters of fixed size, so the wide-character representation is usually used inside programs.
Like ISO C, ISO C++ uses the type wchar_t
to represent wide characters. However in C++, wchar_t
is a keyword rather than a type definition. Thus, it is possible to overload all functions with this type.
In a multibyte string, the same byte may represent a character or even just a part of the character. During iteration through a multibyte string, each byte is interpreted according to a current "shift state." Depending on the value of the byte and the current shift state, a byte may represent a certain character or a change of the current shift state. A multibyte string always starts in some defined initial shift state. For example, in the initial shift state the bytes may represent ISO Latin-1 characters until an escape character is encountered. The character following the escape character identifies the new shift state. For example, that character may switch to a shift state in which the bytes are interpreted as Arabic characters until the next escape character is encountered.
The class template codecvt<>
(described in Section 14.4.4,) is used to convert between different character encodings. This class is used mainly by the class basic_filebuf <>
(see page 627) to convert between internal and external representations. The C++ standard actually makes no assumptions about multibyte character encodings, but it supports the notion of shift states. The members of the codecvt<>
class support an argument that may be used to store an arbitrary state of a string. They also support a function intended to determine the character sequence used to return to the initial shift state.
The different representations of character sets imply variations that are relevant for the processing of strings and I/O. For example, the value used to represent "end-of-file" or the details of comparing characters may differ for representations.
The string and stream classes are intended to be instantiated with built-in types, especially with char
and wchar_t.
The interface of built-in types cannot be changed. Thus, the details on how to deal with aspects that depend on the representation are factored into a separate class, a so-called "traits class." Both the string and stream classes take a traits class as a template argument. This argument defaults to the class char_traits,
parameterized with the template argument that defines the character type of the string or stream:
namespace std { template<class charT, class traits = char_traits<charT>, class Allocator = allocator<charT> > class basic_string; } namespace std { template <class charT, class traits = char_traits<charT> > class basic_istream; template <class charT, class traits = char_traits<charT> > class basic_ostream; ... }
The character traits have type char_traits<>.
This type is defined in <string>
and is parameterized for the specific character type:
namespace std { template <class charT> struct char_traits { ... }; }
The traits classes define all fundamental properties of the character type and the corresponding operations necessary for the implementation of strings and streams as static components. Table 14.1 lists the members of char_traits.
The functions that process strings or character sequences are present for optimization only. They could also be implemented by using the functions that process single characters. For example, copy()
can be implemented using assign().
However, there might be more efficient implementations when dealing with strings.
Note that counts used in the functions are exact counts, not maximum counts. That is, string termination characters within these sequences are ignored.
The last group of functions cares about the special processing of the character that represents end-of-file (EOF). This character extends the character set by an artificial character to indicate special processing. For some representations, the character type may be insufficient to accommodate this special character because it has to have a value that differs from the values of all "normal" characters of the character set. C established the convention to return a character as int
instead of as char from functions reading characters. This technique was extended in C++. The character traits define char_type
as the type to represent all characters, and int_type
as the type to represent all characters plus EOF. The functions to_char_type(), to_int_type(), not_eof(),
and eq_int_type()
define the corresponding conversions and comparisons. It is possible that char_type
and int_type
are identical for some character traits. This can be the case if not all values of char_type
are necessary to represent characters so that there is a spare value that can be used for end-of-file.
pos_type
and off_type
are used to define file positions and offsets (see page 634 for details).
Table 14.1. Character Traits Members
Expression | Meaning |
---|---|
char_type
| The character type (that is, the template argument for char_traits)
|
int_type
| A type large enough to represent an additional, otherwise unused value for end-of-file |
pos_type
| A type used to represent positions in streams |
off_type
| A type used to represent offsets between positions in streams |
state_type
| A type used to represent the current state in multibyte streams |
assign (c1,c2)
| Assigns character c2 to c1 |
eq (c1,c2)
| Returns whether the characters c1 and c2 are equal |
It (c1,c2)
| Returns whether character c1 is less than character c2 |
length
(s)
| Returns the length of the string s |
compare
(s1 ,s2 ,n)
| Compares up to n characters of strings s1 and s2 |
copy
(s1,s2, n)
| Copies n characters of string s2 to string s1 |
move (s1,s2,n)
| Copies n characters of string s2 to string s1, where s1 and s2 may overlap |
assign
(s, n,c)
| Assigns the character c to n characters of string s |
find (s,n,c)
| Returns a pointer to the first character in string s that is equal to c, or returns zero, if there is no such character among the first n characters |
eof()
| Returns the value of end-of-file |
to_int_type (c)
| Converts the character c into the corresponding representation as int_type
|
to_char_type (i)
| Converts the representation i as int_type to a character (the result of converting EOF is undefined)
|
not_eof
(i)
| Returns the value i unless i is the value for EOF; in this case an implementation-dependent value different from EOF is returned |
eq_int_type (i1 ,i2)
| Tests the equality of the two characters i1 and i2 represented as int_type (that is, the characters may be EOF)
|
The C++ standard library provides specializations of char_traits<>
for types char
and wchar_t:
namespace std { template<> struct char_traits<char>; template<> struct char_traits<wchar_t>; }
The specialization for char
is usually implemented by using the global string functions of C that are defined in <cstring>
or <string.h>.
An implementation might look as follows:
namespace std { template<> struct char_traits<char> { //type definitions: typedef char char_type; typedef int int_type; typedef streampos pos_type; typedef streamoff off_type; typedef mbstate_t state_type; //functions: static void assign(char& c1, const char& c2) { c1 = c2; } static bool eq(const char& c1, const char& c2) { return c1 == c2; } static bool It(const char& c1, const char& c2) { return c1 < c2; } static size_t length(const char* s) { return strlen(s); } static int compare(const char* s1, const char* s2, size_t n) { return memcmp(s1,s2,n); } static char* copy(char* s1, const char* s2, size_t n) { return (char*)memcpy(s1,s2,n); } static char* move(char* s1, const char* s2, size_t n) { return (char*)memmove(s1,s2,n); } static char* assign(char* s, size_t n, char c) { return (char*)memset(s,c,n); } static const char* find(const char* s, size_t n, const char& c) { return (const char*)memchr(s,c,n); } static int eof() { return EOF; } static int to_int_type(const char& c) { return (int)(unsigned char)c; } static char to_char_type(const int& i) { return (char)i; } static int not_eof(const int& i) { return i!=EOF ? i : !EOF; } static bool eq_int_type(const int& i1, const int& i2) { return i1 == i2; } };
See Section 11.2.14, for the implementation of a user-defined traits class that lets strings behave in a case-insensitive manner.
One issue in conjunction with character encodings remains: How are special characters such as the newline or the string termination character internationalized? The class basic_ios
has members widen()
and narrow()
that can be used for this purpose. Thus, the newline character in an encoding appropriate for the stream strm
can be written as follows:
strm. widen ('
') // internationalized newline character
The string termination character in the same encoding can be created like this:
strm. widen (' ') // internationalized string termination character
See the implementation of the end1
manipulator on page 613 for an example use.
The functions widen()
and narrow()
actually use a locale object, more precisely the ctype
facet of this object. This facet can be used to convert all characters between char
and some other character representations. It is described in Section 14.4.4,. For example, the following expression converts the character c
of type char
into an object of type char_type
by using the locale object loc
[2]:
std::use_facet<std::ctype<char_type> >(loc).widen(c)
The details of the use of locales and their facets are described in the following sections.
A common approach to internationalization is to use environments, called locales, to encapsulate national or cultural conventions. The C community uses this approach. Thus, in the context of internationalization, a locale is a collection of parameters and functions used to support national or cultural conventions. According to X/Open conventions,[3] the environment variable LANG
is used to define the locale to be used. Depending on this locale, different formats for floating-point numbers, dates, monetary values, and so on are used.
The format of the string defining a locale is normally this:
language [_area [.code]]
language represents the language, such as English or German, area is the area, country, or culture where this language is used. It is used, for example, to support different national conventions even if the same language is used in different nations. code defines the character encoding to be used. This is mainly important in Asia, where different character encodings are used to represent the same character set.
Table 14.2 presents a selection of typical language strings. However, note that these strings are not yet standardized. For example, sometimes the first character of language is capitalized. Some implementations deviate from the format mentioned previously and, for example, use english
to select an English locale. All in all, the locales that are supported by a system are implementation specific.
For programs, it is normally no problem that these names are not standardized! This is because the locale information is provided by the user in some form. It is common that programs simply read environment variables or some similar database to determine which locales to use. Thus, the burden of finding the correct locale names is put on the users. Only if the program always uses a special locale does the name need to be hard coded in the program. Normally, for this case, the C locale is sufficient, and is guaranteed to be supported by all implementations and to have the name C.
The next section presents the use of different locales in C++ programs. In particular, it introduces facets of locales that are used to deal with specific formatting details.
C also provides an approach to handle the problem of character sets with more than 256 characters. This approach is to use the character type wchar_t,
a type definition for one of the integral types with language support for wide-character constants and wide-character string literals. However, apart from this, only functions to convert between wide characters and narrow characters are supported. This approach was also incorporated into C++ with the character type wchar_t,
which is, unlike the C approach, a distinct type in C++. However, C++ provides more library support than C, because basically everything available for char
is also available for wchar_t,
and any other type may be used as a character type.
Table 14.2. Selection of Locale Names
Locale | Meaning |
---|---|
c
| Default: ANSI-C conventions (English, 7 bit) |
de_DE
| German in Germany |
de_DE. 88591
| German in Germany with ISO Latin-1 encoding |
de_AT
| German in Austria |
de_CH
| German in Switzerland |
en_US
| English in the United States |
en_GB
| English in Great Britain |
en_AU
| English in Australia |
en_CA
| English in Canada |
fr_FR
| French in France |
fr_CH
| French in Switzerland |
fr_CA
| French in Canada |
ja_JP. jis
| Japanese in Japan with Japanese Industrial Standard (JIT) encoding |
ja_JP. sjis
| Japanese in Japan with Shift JIS encoding |
ja_JP.ujis
| Japanese in Japan with UNIXized JIS encoding |
ja_JP.EUC
| Japanese in Japan with Extended UNIX Code encoding |
ko_KR
| Korean in Korea |
zh_CN
| Chinese in China |
zh_TW
| Chinese in Taiwan |
lt_LN.bit7
| ISO Latin, 7 bit |
lt_LN.bit8
| ISO Latin, 8 bit |
POSIX
| POSIX conventions (English, 7 bit) |
Using translations of textual messages is normally not sufficient for true internationalization. For example, different conventions for numeric, monetary, or date formatting also have to be used. In addition, functions manipulating letters should depend on character encoding to ensure the correct handling of all characters that are letters in a given language.
According to the POSIX and X/Open standards, it is already possible in C programs to set a locale. This is done using the function setlocale().
Changing the locale influences the results of character classification and manipulation functions, such as isupper()
and toupper(),
and the I/O functions, such as printf().
However, the C approach has several limitations. Because the locale is a global property, using more than one locale at the same time (for example, when reading floating-point numbers in English and writing them in German) is either not possible or is possible only with a relatively large effort. Also, locales cannot be extended. They provide only the facilities the implementation chooses to provide. If something the C locales do not provide must also be adapted to national conventions, a different mechanism has to be used to do this. Finally, it is not possible to define new locales to support special cultural conventions.
The C++ standard library addresses all of these problems with an object-oriented approach. First, the details of a locale
are encapsulated in an object of type locale
. Doing this immediately provides the possibility of using multiple locales at the same time. Operations that depend on locales are configured to use a corresponding locale object. For example, a locale object can be installed for each I/O stream, which is then used by the different member functions to adapt to the corresponding conventions. This is demonstrated by the following example:
// i18n/loc1.cpp #include <iostream> #include <locale> using namespace std; int main() { // use classic C locale to read data from standard input cin.imbue(locale::classic()); // use a German locale to write data to standard ouput cout.imbue(locale("de_DE")); // read and output floating-point values in a loop double value; while (cin >> value) { cout << value << endl; } }
The statement
cin.imbue(locale::classic());
assigns the "classic" C locale to the standard input channel. For the classic C locale, formatting of numbers and dates, character classification, and so on is handled as it is in original C without any locales. The expression
std::locale::classic()
obtains a corresponding object of class locale.
Using the expression
std::locale("C")
instead would yield the same result. This last expression constructs a locale
object from a given name. The name "C"
is a special name, and actually is the only one a C++ implementation is required to support. There is no requirement to support any other locale, although it is assumed that C++ implementations also support other locales.
Correspondingly, the statement
cout.imbue (locale("de_DE"));
assigns the locale de_DE
to the standard output channel. This is, of course, successful only if the system supports this locale. If the name used to construct a locale object is unknown to the implementation, an exception of type runtime_error
is thrown.
If everything was successful, input is read according to the classic C conventions and output is written according to the German conventions. The loop thus reads floating-point values in the normal English format, for example
47.11
and prints them using the German format, for example
47,11
Yes, the Germans really use a comma as a "decimal point".
Normally, a program does not predefine a specific locale except when writing and reading data in a fixed format. Instead, the locale is determined using the environment variable LANG.
Another possibility is to read the name of the locale to be used. The following program demonstrates this:
// i18n/loc2.cpp #include <iostream> #include <locale> #include <string> #include <cstdlib> using namespace std; int main() { //create the default locale from the user's environment locale langLocale(""); //and assign it to the standard ouput channel cout.imbue(langLocale); //process the name of the locale bool isGerman; if (langLocale.name() == "de_DE" || langLocale.name() == "de" || langLocale.name() == "german") { isGerman = true; } else { isGerman = false; } //read locale for the input if (isGerman) { cout << "Sprachumgebung fuer Eingaben: "; } else { cout << "Locale for input: "; } string s; cin >> s; if (!cin) { if (isGerman) { cerr << "FEHLER beim Einlesen der Sprachumgebung" << endl; } else { cerr << "ERROR while reading the locale" << endl; } return EXIT_FAILURE; } locale cinLocale(s.c_str()); //and assign it to the standard input channel cin.imbue(cinLocale); //read and output floating-point values in a loop double value; while (cin >> value) { cout << value << endl; } }
In this example, the following statement creates an object of the class locale:
locale langLocale("");
Passing an empty string as the name of the locale has a special meaning: The default locale from the user's environment is used (this is often determined by the environment variable LANG
). This locale is assigned to the standard input stream with the statement
cout.imbue(langLocale);
The expression
langLocale.name()
is used to retrieve the name of the default locale, which is returned as an object of type string
(see Chapter 11).
The following statements construct a locale from a name read from standard input:
string s; cin >> s; ... locale cinLocale(s.c_str());
To do this, a word is read from the standard input and used as the constructor's argument. If the read fails, the ios_base::failbit
is set in the input stream, which is checked and handled in this program:
if (!cin) { if (isGerman) { cerr << "FEHLER beim Einlesen der Sprachumgebung" << endl; } else { cerr << "ERROR while reading the locale" << endl; } return EXIT_FAILURE; }
Again, if the string is not a valid value for the construction of a locale, a runtime_error
exception is thrown.
If a program wants to honor local conventions, it should use corresponding locale objects. The static member function global()
of the class locale
can be used to install a global locale object. This object is used as the default value for functions that take an optional locale object as an argument. If the locale object set with the global()
function has a name, it is also arranged that the C functions dealing with locales react correspondingly. If the locale set has no name, the consequences for the C functions depend on the implementation.
Here is an example of how to set the global locale object depending on the environment in which the program is running:
/ * create a locale object depending on the program's environment and * set it as the global object */ std::locale::global(std::locale(""));
Among other things, this arranges for the corresponding registration for the C functions to be executed. That is, the C functions are influenced as if the following call was made:
std::setlocale(LC_ALL,"")
However, setting the global locale does not replace locales already stored in objects. It only modifies the locale object copied when a locale is created with a default constructor. For example, the stream objects store locale objects that are not replaced by a call to locale::global().
If you want an existing stream to use a specific locale, you have to tell the stream to use this locale using the imbue()
function.
The global locale is used if a locale object is created with the default constructor. In this case, the new locale behaves as if it is a copy of the global locale at the time it was constructed. The following three lines install the default locale for the standard streams:
// register global locale object for streams
std::cin.imbue(std::locale());
std::cout.imbue(std::locale());
std::cerr.imbue(std::locale());
When using locales in C++, it is important to remember that the C++ locale mechanism is only loosely coupled to the C locale mechanism. There is only one relation to the C locale mechanism: The global C locale is modified if a named C++ locale object is set as the global locale. In general, you should not assume that the C and the C++ functions operate on the same locales.
The actual dependencies on national conventions are separated into several aspects that are handled by corresponding objects. An object dealing with a specific aspect of internationalization is called a facet. A locale object is used as a container of different facets. To access an aspect of a locale, the type of the corresponding facet is used as the index. The type of the facet is passed explicitly as a template argument to the template function use_facet(),
accessing the desired facet. For example, the expression
std::use_facet<std::numpunct<char> >(loc)
accesses the facet type numpunct
for the character type char
of the locale object loc.
Each facet type is defined by a class that defines certain services. For example, the facet type numpunct
provides services used in conjunction with the formatting of numeric and Boolean values. For example, the following expression returns the string used to represent true
in the locale loc.
std::use_facet<std::numpunct<char> >(loc).truename()
Table 14.3 provides an overview over the facets predefined by the C++ standard library. Each facet is associated with a category. These categories are used by some of the constructors of locales to create new locales as the combination of other locales.
Table 14.3. Facet Types Predefined by the C++ Standard Library
Category | Facet Type | Used for |
---|---|---|
numeric
| num_get<>()
| Numeric input |
num_put<>()
| Numeric output | |
numpunct<>()
| Symbols used for numeric I/O | |
time
| time_get<>()
| Time and date input |
time_put<>()
| Time and date output | |
monetary
| money_get<>()
| Monetary input |
money_put<>()
| Monetary output | |
moneypunct <>()
| Symbols used for monetary I/O | |
ctype
| ctype<>()
| Character information(toupper() , isupper())
|
codecvt<>()
| Conversion between different character encodings | |
collate
| collate<>()
| String collation |
messages
| messages<>()
| Message string retrieval |
It is possible to define your own versions of the facets to create specialized locales. The following examples demonstrates how this is done. It defines a facet using German representations of the Boolean values:
class germanBoolNames : public std::numpunct_byname<char> { public: germanBoolNames (const char *name) : std::numpunct_byname<char>(name) { } protected: virtual std::string do_truename() const { return "wahr"; } virtual std::string do_falsename() const { return "falsch"; } };
The class germanBoolNames
derives from the class numpunct_byname,
which is defined by the C++ standard library. This class defines punctuation properties depending on the locale used for numeric formatting. Deriving from numpunct_byname
instead of from numpunct
lets you customize the members not overridden explicitly. The values returned from these members still depend on the name used as the argument to the constructor. If the class numpunct
had been used as the base class, the behavior of the other functions would be fixed. However, the class germanBoolNames
overrides the two functions used to determine the textual representation of true
and false.
To use this facet in a locale, you need to create a new locale using a special constructor of the class locale.
This constructor takes a locale object as its first argument and a pointer to a facet as its second argument. The created locale is identical to the first argument except for the facet that is passed as the second argument. This facet is installed in the newly create locale after the first argument is copied:
std::locale loc (std::locale(""), new germanBoolNames(""));
The new expression creates a facet that is installed in the new locale. Thus, it is registered in loc
to create a variation of locale("").
Since locales are immutable, you have to create a new locale object if you want to install a new facet to a locale. This locale object can be used like any other locale object. For example,
std::cout.imbue(loc); std::cout << std::boolalpha << true << std::endl;
would have the following output:
wahr
You also can create a completely new facet. In this case, the function has_facet()
can be used to determine whether such a new facet is registered for a given locale object.
A C++ locale is an immutable container for facets. It is defined in the <locale>
header file as follows:
namespace std { class locale { public: // global locale objects static const locale& classic(); //classic C locale static locale global(const locale&); //set global locale // internal types and values class facet; class id; typedef int category; static const category none, numeric, time, monetary, ctype, collate, messages, all; // constructors locale() throw(); explicit locale (const char* name); // create locale based on other locales locale (const locale& loc) throw(); locale (const locale& loc, const char* name, category); template <class Facet> locale (const locale& loc, Facet* fp); locale (const locale& loc, const locale& loc2, category); // assignment operator const locale& operator= (const locale& loc) throw(); template <class Facet> locale combine (const locale& loc); // destructor ~locale() throw(); //name (if any) basic_string<char> name() const; // comparisons bool operator== (const locale& loc) const; bool operator!= (const locale& loc) const; //sorting of strings template <class charT, class Traits, class Allocator> bool operator() ( const basic_string<charT,Traits,Allocator>& s1, const basic_string<charT,Traits,Allocator>& s2) const; }; //facet access template <class Facet> const Facet& use_facet (const locale&); template <class Facet> bool has_facet (const locale&) throw(); }
The strange thing about locales is how the objects stored in the container are accessed. A facet in a locale is accessed using the type of the facet as the index. Because each facet exposes a different interface and suits a different purpose, it is desirable to have the access function to locales return a type corresponding to the index. This is exactly what can be done with a type as the index. Using the facet's type as an index has the additional advantage of having a type-safe interface.
Locales are immutable. This means the facets stored in a locale cannot be changed (except when locales are being assigned). Variations of locales are created by combining existing locales and facets to create a new locale. Table 14.4 lists the constructors for locales.
Table 14.4. Constructing Locales
Expression | Effect |
---|---|
locale()
| Creates a copy of the current global locale |
locale (name)
| Creates a locale from the string name |
locale (loc)
| Creates a copy of locale loc |
locale (loc1,loc2, cat)
| Creates a copy of locale loc1, with all facets from category cat replaced with facets from locale loc2 |
locale (loc,name,cat)
| Equivalent to locale (loc,
locale (name) ,cat)
|
locale (loc,fp)
| Creates a copy of locale loc and installs the facet to which fp refers |
loc1 = loc2 | Assigns locale loc2 to locale loc1 |
loc1.template combined <F> (loc2)
| Creates a copy of locale loc1 but with the facet of type F taken from loc2 |
Almost all constructors create a copy of some other locale. Merely copying a locale is considered to be a cheap operation. Basically, it consists of setting a pointer and increasing a reference count. Creating a modified locale is more expensive. In this case, a reference count for each facet stored in the locale has to be adjusted. Although the standard makes no guarantees about such efficient behavior, it is likely that all implementations will be rather efficient for copying locales.
Two of the constructors listed in Table 14.4 take names of locales. The names accepted are not standardized, with the exception of the name C. However, the standard requires that the documentation with the C++ standard library lists the accepted names. It is assumed that most implementations will accept names as outlined in Section 14.2.
The member function combine()
needs some explanation because it uses a feature that was implemented in compilers only recently. It is a member function template with an explicitly specified template argument. This means the template argument is not deduced implicitly from an argument because there is no argument from which the type can be deduced. Instead, the template argument is specified explicitly (type F in this case).
The two functions that access facets in a locale object use the same technique (Table 14.5). The major difference is that these two functions are global template functions, thereby making this ugly syntax involving the template
keyword unnecessary.
The function use_facet()
returns a reference to a facet. The type of this reference is the type passed explicitly as the template argument. If the locale passed as the argument does not contain a corresponding facet, the function throws a bad_cast
exception. The function has_facet()
can be used to test whether a particular facet is present in a given locale.
Table 14.5. Accessing Facets
Expression | Effect |
---|---|
has_facet <F>(loc)
| Returns true if a facet of type F is stored in locale loc |
use_facet <F> (loc)
| Returns a reference to the facet of type F stored in locale loc |
The remaining operations of locales are listed in Table 14.6. The name of a locale is maintained if the locale was constructed from a name, or one or more named locales. However, again, the standard makes no guarantees about the construction of a name resulting from combining two locales. Two locales are considered to be identical if one is a copy of the other or if both locales have the same name. It is natural to consider two objects to be identical if one is a copy of the other. But what about this naming stuff? The idea behind this is basically that the name of the locale reflects the names used to construct the named facets. For example, the locale's name might be constructed by joining the names of the facets in a particular order, separating the individual names by separation characters. Using this scheme it would possible to identify two locale objects as identical if they are constructed by combining the same named facets into locale objects. In other words, the standard basically requires that two locales consisting of the same set of named facets be considered identical. Thus, the names will probably be constructed carefully to support this notion of equality.
Table 14.6. Operations of Locales
Expression | Effect |
---|---|
loc.name()
| Returns the name of locale loc as string
|
loc1 == loc2 | Returns true if loc1 and loc2 are identical locales
|
loc1 != loc2 | Returns true if loc1 and loc2 are different locales
|
loc(str1 ,str2) | Returns the Boolean result of comparing strings str1 and str2 for ordering (whether str1 is less than str2) |
locale::classic()
| Returns locale("C")
|
locale::global (loc)
| Installs loc as the global locale and returns the previous global locale |
The parentheses operator makes it possible to use a locale object as a comparator for strings. This operator uses the string comparison from the collate
facet to compare the strings passed as the argument for ordering. Thus, it returns whether one string is less than the other string according to the locale object. This is the behavior of an STL function object (see Section 8.1,), so you can use a locale object as a sorting criterion for STL algorithms that operate on strings. For example, a vector of strings can be sorted according to the rules for string collation of the German locale as follows:
std::vector<std::string> v; ... // sort strings according to the German locale std::sort (v.begin(),v.end(), //range locale("de_DE")); //sorting criterion
The important aspect of locales are the contained facets. All locales are guaranteed to contain certain standard facets. The description of the individual facets in the following subsections provides which instantiations of the corresponding facet are guaranteed. In addition to these facets, an implementation of the C++ standard library may provide additional facets in the locales. What is important is that the user can also install her own facets or replace standard ones.
Section 14.2.2, discussed how to install a facet in a locale. For example, the class germanBoolNames
was derived from the class numpunct_byname<char>,
one of the standard facets, and installed in a locale using the constructor, taking a locale and a facet as arguments. But what do you need to create your own facet? Every class F that conforms to the following two requirements can be used as a facet:
F
derives publically from class locale::facet.
This base class mainly defines some mechanism for reference counting that is used internally by the locale objects. It also declares the copy constructor and the assignment operator to be private, thereby making it infeasible to copy or to assign facets.
F
has a publically accessible static member named id
of type locale::id.
This member is used to look up a facet in a locale using the facet's type. The whole issue of using a type as the index is to have a type-safe interface. Internally, a normal container with an integer as the index is used to maintain the facets.
The standard facets conform not only to these requirements but also to some special implementation guidelines. Although conforming to these guidelines is not required, doing so is useful. The guidelines are as follows:
All member functions are declared to be const.
This is useful because use_facet()
returns a reference to a const
facet. Member functions that are not declared to be const
can't be invoked.
All public functions are nonvirtual and delegate each request to a protected virtual function. The protected function is named like the public one, with the addition of a leading do_.
For example, numpunct::truename()
calls numpunct::do_truename().
This style is used to avoid hiding member functions when overriding only one of several virtual member functions that has the same name. For example, the class num_put
has several functions named put().
In addition, it gives the programmer of the base class the possibility of adding some extra code in the nonvirtual functions, which is executed even if the virtual function is overridden.
The following description of the standard facets concerns only the public functions. To modify the facet you have always to override the corresponding protected functions. If you define functions with the same interface as the public facet functions, they would only overload them because these functions are not virtual.
For most standard facets, a "_byname"
version is defined. This version derives from the standard facet and is used to create an instantiation for a corresponding locale name. For example, the class numpunct_byname
is used to create the numpunct
facet for a named locale. For example, a German numpunct
facet can be created like this:
std::numpunct_byname("de_DE")
The _byname
classes are used internally by the locale constructors that take a name as an argument. For each of the standard facets supporting a name, the corresponding _byname
class is used to construct an instance of the facet.
Numeric formatting converts between the internal representation of numbers and the corresponding textual representations. The iostream operators delegate the actual conversion to the facets of the locale::numeric
category. This category is formed by three facets:
numpunct,
which handles punctuation symbols used for numeric formatting and parsing
num_put,
which handles numeric formatting
num_get,
which handles numeric parsing
In short, the facet num_put
does the numeric formatting described for iostreams in Section 13.7, and num_get
parses the corresponding strings. Additional flexibility not directly accessible through the interface of the streams is provided by the numpunct
facet.
The numpunct
facet controls the symbol used as the decimal point, the insertion of optional thousands separators, and the strings used for the textual representation of Boolean values. Table 14.7 lists the members of numpunct.
Table 14.7. Members of the numpunct
Facet
Expression | Meaning |
---|---|
np.decimal_point()
| Returns the character used as the decimal point |
np.thousands_sep()
| Returns the character used as the thousands separator |
np.grouping()
| Returns a string describing the positions of the thousands separators
|
np.truename()
| Returns the textual representation of true
|
np.falsename()
| Returns the textual representation of false
|
numpunct
takes a character type charT
as the template argument. The characters returned from decimal_point()
and thousand_sep()
are of this type, and the functions truename()
and falsename()
return a basic_string<charT>.
The two instantiations numpunct<char>
and numpunct<wchar_t>
are required.
Because long numbers are hard to read without intervening characters, the standard facets for numeric formatting and numeric parsing support thousands separators. Often, the digits representing an integer are grouped into triples. For example, one million is written like this:
1,000,000
Unfortunately, it is not used everywhere exactly like that. For example, in German a period is used instead of a comma. Thus, a German would write one million like this:
1.000.000
This difference is covered by the thousands_sep()
member. But this is not sufficient because in some countries digits are not put into triples. For example, in Nepal people would write
10.00.000
using even different numbers of digits in the groups. This is where the string returned from the function grouping()
comes in. The number stored at index i gives the number of digits in the ith group, where counting starts with zero for the rightmost group. If there are fewer characters in the string than groups, the size of the last specified group is repeated. To create unlimited groups, you can use the value numeric_limits<char>: :max()
or, if there is no group at all, the empty string.Table 14.8 lists some examples of the formatting of one million.
Table 14.8. Examples of Numeric Punctuation of One Million
String | Result |
---|---|
{ 0 } or "" (the default for grouping())
| 1000000
|
{ 3, 0 } or "3"
| 1,000,000
|
{ 3, 2, 3, 0 } or "323"
| 10,00,000
|
{ 2, CHAR_MAX, 0 }
| 10000,00
|
Note that normal digits are usually not very useful. For example, the string "2"
specifies groups of 50 digits for ASCII encoding because the character '2'
has the integer value 50
in the ASCII character set.
The num_put
facet is used for textual formatting of numbers. It is a template class that takes two template arguments: the type charT
of the characters to be produced and the type OutIt
of an output iterator to the location at which the produced characters are written. The output iterator defaults to ostreambuf_iterator<charT>.
The num_put
facet provides a set of functions, all called put()
and differing only in the last argument. You can use the facet as follows:
std::locale loc; OutIt to = ...; std: : ios_base& fmt = ...; charT fill = ...; T value = ...; //get numeric output facet of the loc locale const std::num_put<charT,OutIt>& np = std::use_facet<std::num_put<charT,OutIt>(loc); //write value with numeric output facet np.put(to, fmt, fill, value);
These statements would produce a textual representation of the value value
using characters of type charT
written to the output iterator to.
The exact format is determined from the formatting flags stored in fmt,
where the character fill
is used as a fill character. The put()
function returns an iterator pointing immediately after the last character written.
The facet num_put
provides member functions that take objects of types bool, long, unsigned long, double, long double,
and void*
as the last argument. It does not provide member functions, for example, for short
or int.
This is no problem because corresponding values of built-in types are promoted to supported types if necessary.
The standard requires that the two instantiations num_put<char>
and num_put<wchar_t>
are stored in each locale (both using the default for the second template argument). In addition, the C++ standard library supports all instantiations that take a character type as the first template argument and an output iterator type as the second. Of course, it is not required that all of these instantiations are stored in each locale because this would be an infinite amount of facets.
The facet num_get
is used to parse textual representations of numbers. Corresponding to the facet num_put,
it is a template that takes two template arguments: the character type charT
and an input iterator type InIt,
which defaults to istreambuf_iterator<charT>.
It provides a set of get()
functions that differ only in the last argument. You can use the facet as follows:[4]
std::locale loc; // locale InIt beg = ...; // begin of input sequence InIt end = ...; // end of input sequence std::ios_base& fmt = ...; // stream which defines input format std::ios_base::iostate err; // state after call T value; // value after successful call //get numeric input facet of the loc locale const std::num_get<charT,InIt>& ng = std::use_facet<std::num_get<charT,InIt> > (loc); // read value with numeric input facet ng.get(beg, end, fmt, err, value);
These statements attempt to parse a numeric value corresponding to the type T from the sequence of characters between beg
and end.
The format of the expected numeric value is defined by the argument fmt.
If the parsing fails, err
is modified to contain the value ios_base: :failbit.
Otherwise, ios_base: :goodbit
is stored in err
and the parsed value in value.
The value of value
is modified only if the parsing is successful. get()
returns the second parameter (end)
if the sequence was used completely. Otherwise, it returns an iterator pointing to the first character that could not be parsed as part of the numeric value.
The facet num_get
supports functions to read objects of the types bool, long, unsigned short, unsigned int, unsigned long, float, double, long double, and void*.
There are some types for which there is no corresponding function in the num_put
facet; for example, unsigned short.
This is because writing a value of type unsigned short
produces the same result as writing a value of type unsigned short
promoted to an unsigned long.
However, reading a value as type unsigned long
and then converting it to unsigned short
may yield a different value than reading it as type unsigned short
directly.
The standard requires that the two instantiations num_get<char>
and num_get<wchar_t>
be stored in each locale (both using the default for the second template argument). In addition, the C++ standard library supports all instantiations that take a character type as the first template argument and an input iterator type as the second. As with num_put,
not all supported instantiations are required to be present in all locale objects.
The two facets time_get
and time_put
in the category time
provide services for parsing and formatting times and dates. This is done by the member functions that operate on objects of type tm.
This type is defined in the header tile <ctime>.
The objects are not passed directly; rather, a pointer to them is used as the argument.
Both facets in the time
category depend heavily on the behavior of the function strftime()
(also defined in the header file <ctime>
). This function uses a string with conversion specifiers to produce a string from a t
m object. Table 14.9 provides a brief summary of the conversion specifiers. The same conversion specifiers are also used by the time_put
facet.
Of course, the exact string produced by strftime()
depends on the C locale in effect. The examples in the table are given for the "C"
locale.
The facet time_get
is a template that takes a character type charT
and an input iterator type InIt
as template arguments. The input iterator type defaults to istreambuf_iterator<charT>.
Table 14.10 lists the members defined for the time_get
facet. All of these members, except date_order(),
parse the string and store the results in the tm
object pointed to by the argument t.
If the string could not be parsed correctly, either an error is reported (for example, by modifying the argument err
) or an unspecified value is stored. This means that a time produced by a program can be parsed reliably but user input cannot. With the argument fmt,
other facets used during parsing are determined. Whether other flags from fmt
have any influence on the parsing is not specified.
All functions return an iterator that has the position immediately after the last character read. The parsing stops if parsing is complete or if an error occurs (for example, because a string could not be parsed as a date).
A function reading the name of a weekday or a month reads both abbreviated names and full names. If the abbreviation is followed by a letter, which would be legal for a full name, the function attempts to read the full name. If this fails, the parsing fails, even though an abbreviated name was already parsed successfully.
Table 14.9. Conversion Specifiers for strftime()
Specifier | Meaning | Example |
---|---|---|
%a
| Abbreviated weekday | Mon
|
%A
| Full weekday | Monday
|
%b
| Abbreviated month name | Jul
|
%B
| Full month name | July
|
%c
| Locale's preferred date and time representation | Jul 12 21:53:22 1998
|
%d
| Day of the month | 12
|
%H
| Hour of the day using a 24-hour clock | 21
|
%I
| Hour of the day using a 12-hour clock | 9
|
%j
| Day of the year | 193
|
%m
| Month as decimal number | 7
|
%M
| Minutes | 53
|
%P
| Morning or evening (am or pm)
| pm
|
%S
| Seconds | 22
|
%U
| Week number starting with the first Sunday | 28
|
%W
| Week number starting with the first Monday | 28
|
%w
| Weekday as a number (Sunday == 0) | 0
|
%x
| Locale's preferred date representation | Jul 12 1998
|
%X
| Locale's preferred time representation | 21:53:22
|
%y
| The year without the century | 98
|
%Y
| The year with the century | 1998
|
%Z
| The time zone | MEST
|
%%
| The literal % | %
|
Whether a function that is parsing a year allows two-digit years is unspecified. The year that is assumed for a two-digit year, if it is allowed, is also unspecified.
date_order()
returns the order in which the day, month, and year appear in a date string. This is necessary for some dates because the order cannot be determined from the string representing a date. For example, the first day in February in the year 2003 may be printed either as 3/2/1
or as 1/2/3.
Class time_base,
which is the base class of the facet time_get,
defines an enumeration called dateorder
for possible date order values. Table 14.11 lists these values.
The standard requires that the two instantiations time_get<char>
and time_get<wchar_t>
are stored in each locale. In addition, the C++ standard library supports all instantiations that take char
or wchar_t
as the first template argument, and a corresponding input iterator as the second. All of these instantiations are not required to be stored in each locale object.
Table 14.10. Members of the time_get
Facet
Expression | Meaning |
---|---|
tg.get_time ( beg , to , fmt , err , t )
| Parses the string between beg and end as the time produced by the X specifier for strftime()
|
tg.get_date( beg,end,fmt ,err,t )
| Parses the string between beg and end as the date produced by the x specifier for strftime()
|
tg.get_weekday ( beg, end , fmt , err , t )
| Parses the string between beg and end as the name of the weekday |
tg.get_monthname ( beg , end , fmt , err , t )
| Parses the string between beg and end as the name of the month |
tg.get_year ( beg, end , fmt , err , t )
| Parses the string between beg and end as the year |
tg.date_order( )
| Returns the date order used by the facet |
The facet time_put
is used for formatting times and dates. It is a template that takes as arguments a character type charT
and an optional output iterator type Out It.
The latter defaults to type ostreambuf_iterator
(see page 665).
The facet time_put
defines two functions called put(),
which are used to convert the date information stored in an object of type tm
into a sequence of characters written to an output iterator. Table 14.12 lists the members of the facet time_put.
Table 14.12. Members of the time_put
Facet
Expression | Meaning |
---|---|
tp.put (to , fmt ,fill , t , cbeg , cend)
| Converts according to the string [cbeg,cend) |
tp.put (to , fmt , fill , t , cvt ,mod)
| Converts using the conversion specifier cvt |
Both functions write their results to the output iterator to
and return an iterator pointing immediately after the last character produced. The argument fmt is of type ios_base
and is used to access other facets and potentially additional formatting information. The character fill
is used when a space character is needed and for filling. The argument t
points to an object of type tm
that is storing the date to be formatted.
The version of put()
that takes two characters as the last two arguments formats the date found in the tm
object to which t
refers, interpreting the argument cvt
like a conversion specifier to strftime().
This put()
function does only one conversion; namely, the one specified by the cvt
character. This function is called by the other put()
function for each conversion specifier found. For example, using 'X'
as the conversion specifier results in the time that is stored in *t
being written to the output iterator. The meaning of the argument mod
is not defined by the standard. It is intended to be used as a modifier to the conversion as found in several implementations of the strftime()
function.
The version of put()
that takes a string defined by the range [cbeg,cend) to guide the conversion behaves very much like strftime().
It scans the string and writes every character that is not part of a conversion specification to the output iterator to.
If it encounters a conversion specification introduced by the character %,
it extracts an optional modifier and a conversion specifier. The function continues by calling the other version of put(),
using the conversion specifier and the modifier as the last two arguments. After processing a conversion specification, put()
continues to scan the string.
Note that this facet is somewhat unusual because it provides a nonvirtual member function; namely, the function put(),
which uses a string as the conversion specification. This function cannot be overridden in classes derived from time_put.
Only the other put()
function can be overridden.
The standard requires that the two instantiations time_put<char>
and time_put<wchar_t>
are stored in each locale. In addition, the C++ standard library supports all instantiations that take char
or wchar_t
as the first template argument and a corresponding output iterator as the second. There is no guaranteed support for instantiations using a type other than char
or wchar_t
as the first template argument. Also, it is not guaranteed that any instantiations other than time_put<char>
and time_put<wchar_t>
be stored in locale objects by default.
The category monetary
consists of the facets moneypunct, money_get,
and money_put.
The facet moneypunct
defines the format of monetary values. The other two use this information to format or to parse a monetary value.
Monetary values are printed differently depending on the context. The formats used in different cultural communities differ widely. Examples of the varying details are the placement of the currency symbol (if present at all), the notation for negative or positive values, the use of national or international currency symbols, and the use of thousands separators. To provide the necessary flexibility, the details of the format are factored into the facet moneypunct.
The facet moneypunct
is a template that takes as arguments a character type charT
and a Boolean value that defaults to false.
The Boolean value indicates whether local (false
) or international (true
) currency symbols are to be used. Table 14.13 lists the members of the facet moneypunct.
Table 14.13. Members of the moneypunct
Facet
Expression | Meaning |
---|---|
mp.decimal_point()
| Returns a character to be used as the decimal point |
mp.thousands_sep()
| Returns a character to be used as the thousands separator |
mp.grouping()
| Returns a string specifying the placement of the thousands separators |
mp.curr_symbol()
| Returns a string with the currency symbol |
mp.positive_sign()
| Returns a string with the positive sign |
mp.negative_sign()
| Returns a string with the negative sign |
mp.frac_digits()
| Returns the number of fractional digits |
mp.pos_format()
| Returns the format to be used for non-negative values |
mp.neg_format()
| Returns the format to be used for negative values |
moneypunct
derives from the class money_base.
This base class defines an enumeration called part,
which is used to form a pattern for monetary values. The class also defines a type called pattern
(which is actually a type definition for char [4]
). This type is used to store four values of type part
that form a pattern describing the layout of a monetary value. Table 14.14 lists the five possible parts
that can be placed in a pattern.
Table 14.14. Parts of Monetary Layout Patterns
Value | Meaning |
---|---|
none
| At this position, spaces may appear but are not required |
space
| At this position, at least one space is required |
sign
| At this position, a sign may appear |
symbol
| At this position, the currency symbol may appear |
value
| At this position, the value appears |
moneypunct
defines two functions that return patterns: the function neg_format()
for negative values and the function pos_format()
for non-negative values. In a pattern, each of the parts sign, symbol,
and value
is mandatory, and one of the parts none
and space
has to appear. This does not mean, however, that there is really a sign or a currency symbol printed. What is printed at the positions indicated by the parts depends on the values returned from other members of the facet and on the formatting flags passed to the functions for formatting.
Only the value always appears. Of course, it is placed at the position where the part value appears in the pattern. The value has exactly frac_digits()
fractional digits, with decimal_point()
used as the decimal point (unless there are no fractional digits, in which case no decimal point is used).
When reading monetary values, thousand separators are allowed but not required in the input. When present they are checked for correct placements according to grouping()
. If grouping ()
is empty, no thousand separators are allowed. The character used for the thousands separator is the one returned from thousands_sep()
. The rules for the placement of the thousands separators are identical to the rules for numeric formatting (see page 705). When monetary values are printed, thousands separators are always inserted according to the string returned from grouping()
. When monetary values are read, thousands separators are optional unless the grouping string is empty. The correct placement of thousands separators is checked after all other parsing is successful.
The parts space
and none
control the placement of spaces. space
is used at a position where at least one space is required. During formatting, if ios_base::internal
is specified in the format flags, fill characters are inserted at the position of the space
or the none
part. Of course, filling is done only if the minimum width specified is not used with other characters. The character used as the space character is passed as the argument to the functions for the formatting of monetary values. If the formatted value does not contain a space, none
can be placed at the last position. space
and none
may not appear as the first part in a pattern, and space
may not be the last part in a pattern.
Signs for monetary values may consist of more than one character. For example, in certain contexts parentheses around a value are used to indicate negative values. At the position where the sign
part appears in the pattern, the first character of the sign appears. All other characters of the sign appear at the end after all other components. If the string for a sign is empty, no character indicating the sign appears. The character that is to be used as a sign is determined with the function positive_sign()
for non-negative values and negative_sign()
for negative values.
At the position of the symbol
part, the currency symbol appears. The symbol is present only if the formatting flags used during formatting or parsing have the ios_base::showbase
flag set. The string returned from the function curr_symbol()
is used as the currency symbol. The currency symbol is a local symbol to be used to indicate the currency if the second template argument is false
(the default). Otherwise, an international currency symbol is used.
Table 14.15 illustrates all of this, using the value $-1234.56
as an example. Of course, this means that frac_digits()
returns 2.
In addition, a width of 0
is always used.
The standard requires that the instantiations moneypunct<char>, moneypunct<wchar_t>, moneypunct<char, true>,
and moneypunct<wchar_t, true>
are stored in each locale. The C++ standard library does not support any other instantiation.
The facet money_put
is used to format monetary values. It is a template that takes a character type charT
as the first template argument and an output iterator OutIt
as the second. The output iterator defaults to ostreambuf _iterator<charT>.
The two member functions put()
produce a sequence of characters corresponding to the format specified by a moneypunct
facet. The value to be formatted is either passed as type long double or as type basic_string<charT>.
You can use the facet as follows:
Table 14.15. Examples of Using the Monetary Pattern
Pattern | Sign | Result |
---|---|---|
symbol none sign value
| $1234.56 | |
symbol none sign value
| - | $-1234.56 |
symbol space sign value
| - | $ -1234.56 |
symbol space sign value
| ( ) | $ (1234.56) |
sign symbol space value
| ( ) | ($ 1234.56) |
sign value space symbol
| - | (1234.56 $) |
symbol space value sign
| - | $ 1234.56- |
sign value space symbol
| - | -1234.56 $ |
sign value none symbol
| - | -1234.56 $ |
//get monetary output facet of the loc locale const std::money_put<charT,OutIt>& mp = std::use_facet<std::money_put<charT,OutIt> >(loc); // write value with monetary output facet mp.put(to, intl, frat, fill, value);
The argument to
is an output iterator of type OutIt
to which the formatted string is written. put()
returns an object of this type pointing immediately after the last character produced. The argument intl
indicates whether a local or an international currency symbol is to be used. fmt
is used to determine formatting flags, such as the width to be used and the moneypunct
facet defining the format of the value to be printed. Where a space character has to appear, the character fill
is inserted.
The argument value
has type long double
or type basic_string<charT>.
This is the value that is formatted. If the argument is a string, this string may consist only of decimal digits with an optional leading minus sign. If the first character of the string is a minus sign, the value is formatted as a negative value. After it is determined that the value is negative, the minus sign is discarded. The number of fractional digits in the string is determined from the member function frac_digits()
of the moneypunct
facet.
The standard requires that the two instantiations money_put<char>
and money_put<wchar_t>
are stored in each locale. In addition, the C++ standard library supports all instantiations that take char
or wchar_t
as the first template argument and a corresponding output iterator as the second. All of these instantiations are not required to be stored in each locale object.
The facet money_get
is used for parsing of monetary values. It is a template class that takes a character type charT
as the first template argument and an input iterator type InIt
as the second. The second template argument defaults to istreambuf_iterator<charT>.
This class defines two member functions called get()
that try to parse a character and, if the parse is successful, store the result in a value of type long double
or of type basic_string<charT>.
You can use the facet as follows:
//get monetary input facet of the loc locale const std::money_get<charT,InIt>& mg = std::use_facet<std::money_get<charT,InIt> >(loc); //read value with monetary input facet mg.get(beg, end, intl, fmt, err, val);
The character sequence to be parsed is defined by the sequence between beg
and end.
The parsing stops as soon as either all elements of the used pattern are read or an error is encountered. If an error is encountered, the ios_base::failbit
is set in err
and nothing is stored in val.
If parsing is successful, the result is stored in the value of types long double
or basic_string
that is passed by reference as argument val.
The argument intl
is a Boolean value that selects a local or an international currency string. The moneypunct
facet defining the format of the value to be parsed is retrieved using the locale object imbued by the argument fmt.
For parsing a monetary value, the pattern returned from the member neg_format()
of the moneypunct
facet is always used.
At the position of none
or space,
the function that is parsing a monetary value consumes all available space, unless none
is the last part in a pattern. Trailing spaces are not skipped. The get()
functions return an iterator that points after the last character that was consumed.
The standard requires that the two instantiations money_get<char>
and money_get<wchar_t>
be stored in each locale. In addition, the C++ standard library supports all instantiations that take char
or wchar_t
as the first template argument and a corresponding input iterator as the second. All of these instantiations are not required to be stored in each locale object.
The C++ standard library defines two facets to deal with characters: ctype
and codecvt.
Both belong to the category locale::ctype.
The facet ctype
is used mainly for character classification, such as testing whether a character is a letter. It also provides methods for conversion between lowercase and uppercase letters and for conversion between char
and the character type for which the facet is instantiated. The facet codecvt
is used to convert characters between different encodings and is used mainly by basic_filebuf
to convert between external and internal representations.
The facet ctype
is a template class parameterized with a character type. Three kinds of functions are provided by the class ctype<charT>:
Functions to convert between char
and charT
Functions for character classification
Functions for conversion between uppercase and lowercase letters
Table 14.16 lists the members defined for the facet ctype.
Table 14.16. Services Defined by the ctype<charT>
Facet
Expression | Effect |
---|---|
ct.is (m,c)
| Tests whether the character c matches the mask m |
ct.is (beg ,end, vec)
| For each character in the range between beg and end, places a mask matched by the character in the corresponding location of vec |
ct.scan_is (m,beg,end)
| Returns a pointer to the first character in the range between beg and end that matches the mask m or end if there is no such character |
ct.scan_not (m , beg , end)
| Returns a pointer to the first character in the range between beg and end that does not match the mask m or end if all characters match the mask |
ct.toupper (c)
| Returns an uppercase letter corresponding to c if there is such a letter; otherwise c is returned |
ct.toupper (beg,end)
| Converts each letter in the range between beg and end by replacing the letter with the result of toupper()
|
ct.tolower (c)
| Returns a lowercase letter corresponding to c if there is such a letter; otherwise c is returned |
ct.tolower (beg,end)
| Converts each letter in the range between beg and end by replacing the letter with the result of tolower()
|
ct.widen (c)
| Returns the char converted to charT
|
ct.widen (beg, end, dest)
| For each character in the range between beg and end, places the result of widen() at the corresponding location in dest
|
ct.narrow (c , default)
| Returns the charT
c converted to char, or the char
default if there is no suitable character
|
ct.narrow (beg, end, default, dest)
| For each character in the range between beg and end,places the result of narrow() at the corresponding location in dest
|
The function is(beg,end, vec) is used to store a set of masks in an array. For each of the characters in the range between beg and end, a mask with the attributes corresponding to the character is stored in the array pointed to by vec. This is useful to avoid virtual function calls for the classification of characters if there are lots of characters to be classified.
The function widen()
can be used to convert a character of type char
from the native character set to the corresponding character in the character set used by a locale. Thus, it makes sense to widen a character even if the result is also of type char.
For the opposite direction, the function narrow()
can be used to convert a character from the character set used by the locale to a corresponding char
in the native character set, provided there is such a char.
For example, the following code converts the decimal digits from char
to wchar_t:
std::locale loc; char narrow[] = "0123456789"; wchar_t wide [10]; std::use_facet<std::ctype<wchar_t> >(loc).widen(narrow, narrow+10, wide);
Class ctype
derives from the class ctype_base.
This class is used only to define an enumeration called mask.
This enumeration defines values that can be combined to form a bitmask used for testing character properties. The values defined in ctype_base
are shown in Table 14.17. The functions for character classification all take a bitmask as an argument, which is formed by combinations of the values defined in ctype_base.
To create bitmasks as needed, you can use the operators for bit manipulation (|, &,^, and ~
). A character matches this mask if it is any of the characters identified by the mask.
Table 14.17. Character Mask Values Used by ctype
Value | Meaning |
---|---|
ctype_base::alnum
| Tests for letters and digits (equivalent to alpha | digit )
|
ctype_base:: alpha
| Tests for letters |
ctype_base::cntrl
| Tests for control characters |
ctype_base:: digit
| Tests for decimal digits |
ctype_base:: graph
| Tests for punctuation characters, letters, and digits (equivalent to alnum | punct )
|
ctype_base :: lower
| Tests for lowercase letters |
ctype_base:: print
| Tests for printable characters |
ctype_base::punct
| Tests for punctuation characters |
ctype_base :: space
| Tests for space characters |
ctype_base:: upper
| Tests for uppercase letters |
ctype_base::xdigit
| Tests for hexadecimal digits |
For better performance of the character classification functions, the facet ctype
is specialized for the character type char.
This specialization does not delegate the functions dealing with character classification (is(), scan_is(),
and scan_not()
) to corresponding virtual functions. Instead, these functions are implemented inline using a table lookup. For this case additional members are provided (Table 14.18).
Table 14.18. Additional Members of ctype<char>
Expression | Effect |
---|---|
ctype<char>::table_size
| Returns the size of the table (>=256 )
|
ctype<char>:: classic_table()
| Returns the table for the "classic" C locale |
ctype<char> ( table,del=false)
| Creates the facet with table table |
ct.table()
| Returns the current table of facet ct
|
Manipulating the behavior of these functions for specific locales is done with a corresponding table of masks that is passed as a constructor argument:
// create and initialize the table std::ctype_base::mask mytable[std::ctype<char>::table_size] = { ... }; // use the table for the ctype<char>facet ct std::ctype<char> ct(mytable, false);
This code constructs a ctype<char>
facet that uses the table mytable
to determine the character class of a character. More precisely, the character class of the character c is determined by
mytable[static_cast<unsigned char>(c)]
The static member table_size
is a constant defined by the library implementation and gives the size of the lookup table. This size is at least 256 characters. The second optional argument to the constructor of ctype<char>
indicates whether the table should be deleted if the facet is destroyed. If it is true,
the table passed to the constructor is released by using delete []
when the facet is no longer needed.
The member function table()
is a protected member function that returns the table that is passed as the first argument to the constructor. The static protected member function classic_table()
returns the table that is used for character classification in the classic C
locale.
Convenient use of the ctype
facets is provided by predefined global functions. Table 14.19 lists all of the global functions.
Table 14.19. Global Convenience Functions for Character Classification
Function | Effect |
---|---|
isalnum (c, loc)
| Returns whether c is a letter or a digit (equivalent to isalpha()&&isdigit() )
|
isalpha (c, loc)
| Returns whether c is a letter |
iscntrl (c, loc)
| Returns whether c is a control character |
isdigit (c, loc)
| Returns whether c is a digit |
isgraph (c, loc)
| Returns whether c is a printable, nonspace character (equivalent to isalnum()&&ispunct())
|
islower (c, loc)
| Returns whether c is a lowercase letter |
isprint (c, loc)
| Returns whether c is a printable character (including whitespaces) |
ispunct (c, loc)
| Returns whether c is a punctuation character (that is, it is printable, but it is not a space, digit, or letter) |
isspace (c, loc)
| Returns whether c is a space character |
isupper (c, loc)
| Returns whether c is an uppercase letter |
isxdigit (c, loc)
| Returns whether c is a hexadecimal digit |
tolower (c, loc)
| Converts c from an uppercase letter to a lowercase letter |
toupper (c, loc)
| Converts c from a lowercase letter to an uppercase letter |
For example, the following expression determines whether the character c is a lowercase letter in the locale loc:
std::islower(c,loc)
It returns a corresponding value of type bool.
The following expression returns the character c
converted to an uppercase letter, if c
is a lowercase letter in the locale loc:
std::toupper(c,loc)
If c
is not a lowercase letter, the first argument is returned unmodified.
The expression
std::islower(c,loc)
is equivalent to the following expression:
std::use_facet<std::ctype<char> >(loc).is(std::ctype_base::lower,c)
This expression calls the member function is()
of the facet ctype<char>. is()
determines whether the character c
fulfills any of the character properties that are passed as the bitmask in the first argument. The values for the bitmask are defined in the class ctype_base.
See page 502 and page 669 for examples of the use of these convenience functions.
The global convenience functions for character classification correspond to C functions that have the same name but only the first argument. They are defined in <cctype>
and <ctype.h>,
and always use the current global C locale.[4] Their use is even more convenient:
if (std::isdigit(c)) { ... }
However, by using them you can't use different locales in the same program. Also, you can't use a user-defined ctype
facet using the C function. See page 497 for an example that demonstrates how to use these C functions to convert all characters of a string to uppercase letters.
It is important to note that the C++ convenience functions should not be used in code sections where performance is crucial. It is much faster to obtain the corresponding facet from the locale and use the functions on this object directly. If a lot of characters are to be classified according to the same locale, this can be improved even more, at least for non-char
characters. The function is
(beg,end,vec) can be used to determine the masks for typical characters: This function determines for each character in the range [beg,end)a
mask that describes the properties of the character. The resulting mask is stored in vec at the position corresponding to the character's position. This vector can then be used for fast lookup of the characters.
The facet codecvt
is used to convert between internal and external character encoding. For example, it can be used to convert between Unicode and EUC (Extended UNIX Code), provided the implementation of the C++ standard library supports a corresponding facet.
This facet is used by the class basic_filebuf
to convert between the internal representation and the representation stored in a file. The class basic_filebuf <charT,traits>
(see page 627) uses the instantiation codecvt<charT,char,typename traits::state_type>
to do so. The facet used is taken from the locale stored with basic_filebuf.
This is the major application of the codecvt
facet. Only rarely is it necessary to use this facet directly.
In Section 14.1, some basics of character encodings are introduced. To understand codecvt,
you need to know that there are two approaches for the encoding of characters: One is character encodings that use a fixed number of bytes for each character (wide-character representation), and the other is character encodings that use a varying number of bytes per character (multibyte representation).
It is also necessary to know that multibyte representations use so-called shift states for space efficient representation of characters. The correct interpretation of a byte is possible only with the correct shift state at this position. This in turn can be determined only by walking through the whole sequence of multibyte characters (see Section 14.1, for more details).
The codecvt<>
facet takes three template arguments:
The character type internT
used for an internal representation
The type externT
used to represent an external representation
The type stateT
used to represent an intermediate state during the conversion
The intermediate state may consist of incomplete wide characters or the current shift state. The C++
standard makes no restriction about what is stored in the objects representing the state.
The internal representation always uses a representation with a fixed number of bytes per character. Mainly the two types char
and wchar_t
are intended to be used within a program. The external representation may be a representation that uses a fixed size or a multibyte representation. When a multibyte representation is used, the second template argument is the type used to represent the basic units of the multibyte encoding. Each multibyte character is stored in one or more objects of this type. Normally, the type char
is used for this.
The third argument is the type used to represent the current state of the conversion. It is necessary, for example, if one of the character encodings is a multibyte encoding. In this case, the processing of a multibyte character might be terminated because the source buffer is drained or the destination buffer is full while one character is being processed. If this happens, the current state of the conversion is stored in an object of this type.
Similar to the other facets, the standard requires support for only very few conversions. Only the following two instantiations are supported by the C++
standard library:
codecvt<char,char,mbstate_t>,
which converts the native character set to itself (this is actually a degenerated version of the codecvt
facet)
codecvt<wchar_t,char,mbstate_t>,
which converts between the native tiny character set(that is, char
) and the native wide-character set (that is, wchar_t
)
The C++
standard does not specify the exact semantics of the second conversion. The only natural thing to do, however, is to split each wchar_t
into sizeof(wchar_t)
objects of type char for the conversion from wchar_t
to char,
and to assemble a wchar_t
from the same amount of chars
when converting in the opposite direction. Note that this conversion is very different from the conversion between char and wchar_t
done by the widen()
and narrow()
member functions of the ctype
facet: While the codecvt
functions use the bits of multiple chars
to form one wchar_t
(or vice versa), the ctype
functions convert a character in one encoding to the corresponding character in another encoding (if there is such a character).
Like the ctype
facet, codecvt
derives from a base class used to define an enumeration type. This class is named codecvt.base,
and it defines an enumeration called result.
The values of this enumeration are used to indicate the results of codecvt's
members. The exact meanings of the values depend on the member function used. Table 14.20 lists the member functions of the codecvt
facet.
The function in()
converts an external representation to an internal representation. The argument s
is a reference to a stateT.
At the beginning, this argument represents the shift state used when the conversion is started. At the end, the final shift state is stored there. The shift state passed in can differ from the initial state if the input buffer to be converted is not the first buffer being converted. The arguments fb
(from begin) and fe
(from end) are of type const internT*,
and represent the beginning and the end of the input buffer. The arguments tb
(to begin) and te
(to end) are of type externT*,
and represent the beginning and the end of the output buffer. The arguments
Table 14.20. Members of the codecvt
Facet
Expression | Meaning |
---|---|
cvt.in(s,fb,fe,fn,tb,te,tn)
| Converts external representation to internal representation |
cvt. out (s , fb , fe , fn , tb , te , tn)
| Converts internal representation to external representation |
cvt.unshift(s,tb,te,tn)
| Writes escape sequence to switch to initial shift state |
cvt.encoding()
| Returns information about the external encoding |
cvt. always_noconv()
| Returns true if no conversion will ever be done
|
cvt.length(s,fb,fe,max)
| Returns the number of externT s from the sequence between fb and fe to produce max internal characters
|
cvt.max_length()
| Returns the maximum number of externT s necessary to produce one internT
|
fn
(from next, of type const externT*&
) and tn
(to next, of type internT*&
) are references used to return the end of the sequence converted in the input buffer and the output buffer respectively. Either buffer may reach the end before the other buffer reaches the end. The function returns a value of type codecvt_base:: result,
as indicated in Table 14.21.
Table 14.21. Return Values of the Conversion Functions
Value | Meaning |
---|---|
ok
| All source characters were converted successfully |
partial
| Not all source characters were converted, or more characters are needed to produce a destination character |
error
| A source character was encountered that cannot be converted |
noconv
| No conversion was necessary |
If ok
is returned the function made some progress. If fn == fe
holds, this means that the whole input buffer was processed and the sequence between tb
and tn
contains the result of the conversion. The characters in this sequence represent the characters from the input sequence, potentially with a finished character from a previous conversion. If the argument s
passed to in()
was not the initial state, a partial character from a previous conversion that was not completed could have been stored there.
If partial
is returned, either the output buffer was full before the input buffer could be drained or the input buffer was drained when a character was not yet complete (for example, because the last byte in the input sequence was part of an escape sequence switching between shift states). If fe == fn,
the input buffer was drained. In this case, the sequence between tb
and tn
contains all characters that were converted completely but the input sequence terminated with a partially converted character. The necessary information to complete this character's conversion during a subsequent conversion is stored in the shift state s. If fe ! = fn,
the input buffer was not completely drained. In this case, te == tn
holds; thus, the output buffer is full. The next time the conversion is continued, it should start with fn.
The return value noconv
indicates a special situation. That is, no conversion was necessary to convert the external representation to the internal representation. In this case, fn
is set to fb
and tn
is set to tb.
Nothing is stored in the destination sequence because everything is already stored in the input sequence.
If error
is returned, that means a source character that could not be converted was encountered. There are several reasons why this can happen. For example, the destination character set has no representation for a corresponding character, or the input sequence ends up with an illegal shift state. The C++
standard does not define any method that can be used to determine the cause of the error more precisely.
The function out()
is equivalent to the function in(),
except that it converts in the opposite direction. That is, it converts an internal representation to an external representation. The meanings of the arguments and the values returned are the same; only the types of the arguments are swapped. That is, tb
and te
now have the type const internT*,
and fb
and fe
now have the type const externT*.
The same applies to fn
and tn.
The function unshift()
inserts characters necessary to complete a sequence when the current state of the conversion is passed as the argument s
. This normally means that a shift state is switched to the initial switch state. Only the external representation is terminated. Thus, the arguments tb
and tf
are of type externT*,
and tn
is of type externT&*.
The sequence between tb
and te
defines the output buffer in which the characters are stored. The end of the result sequence is stored in tn. unshift()
returns a value as shown in Table 14.22.
Table 14.22. Return Values of the Function unshift()
Value | Meaning |
---|---|
ok
| The sequence was completed successfully |
partial
| More characters need to be stored to complete the sequence |
error
| The state is invalid |
noconv
| No character was needed to complete the sequence |
The function encoding()
returns some information about the encoding of the external representation. If encoding()
returns −1,
the conversion is state dependent. If encoding()
returns 0, the number of externT
s needed to produce an internal character is not constant. Otherwise, the number of externT
s need to produce an internT
is returned. This information can be used to provide appropriate buffer sizes.
The function always_noconv()
returns true
if the functions in()
and out()
never perform a conversion. For example, the standard implementation of codecvt<char, char, mbstate_t>
does no conversion, and thus, always_noconv()
returns true
for this facet. However, this only holds for the codecvt
facet from the "C"
locale. Other instances of this facet may actually do a conversion.
The function length()
returns the number of externT
s from the sequence between fb
and fe
necessary to produce max
characters of type internT.
If there are fewer than max
complete internT
characters in the sequence between fb
and fe,
the number of externT
s used to produce a maximum number of internTs
from the sequence is returned.
The facet collate
handles differences between conventions for the sorting of strings. For example, in German the letter "ü" is treated as being equivalent to the letter "u" or to the letters "ue" for the purpose of sorting strings. For other languages, this letter is not even a letter, and it is treated as a special character, when it is treated at all. Other languages use slightly different sorting rules for certain character sequences. The collate
facet can be used to provide a sorting of strings that is familiar to the user. Table 14.23 lists the member functions of this facet. In this table, col
is an instantiation of collate,
and the arguments passed to the functions are iterators that are used to define strings.
Table 14.23. Members of the collate<>
Facet
Expression | Meaning |
---|---|
col.compare (beg1 ,end1 ,beg2,end2)
| Returns
1 if the first string is greater than the second
0 if both strings are equal
−1 if the first string is smaller than the second
|
col.transform (beg ,end)
| Returns a string to be compared with other transformed strings |
col.hash (beg , end)
| Returns a hash value (of type long ) for the string
|
The collate
facet is a class template that takes a character type charT
as its template argument. The strings passed to collate's
members are specified using iterators of type const charT*.
This is somewhat unfortunate because there is no guarantee that the iterators used by the type basic_string<charT>
are also pointers. Thus, strings have to be compared using something like this:
locale loc; string s1, s2; ... //get collate facet of the loc locale const std::collate<charT>& col = std::use_facet<std::collate<charT> >(loc); //compare strings by using the collate facet int result = col.compare(s1.data(), si.data()+s1.size(), s2.data(), s2.data()+s2.size()); if (result == 0) { //s1 and s2 are equal ... }
The reason for this limitation is that you cannot predict which iterator types are necessary. It would be necessary to have collation facets for the pointer type and for an infinite amount of iterator types.
Of course, here the special convenience function of locale
can be used to compare strings (see page 703):
int result = loc(s1,s2);
But this works only for the compare()
member function. There are no convenient functions defined by the C++
standard library for the other two members of collate.
The transform()
function returns an object of type basic_string<charT>.
The lexicographical order of strings returned from transform()
is the same as the order of the original strings using collate().
This ordering can be used for better performance if one string has to be compared with many other strings. Determining the lexicographical order of strings can be much faster than using collate().
This is because the national sorting rules can be relatively complex.
The C++ standard library mandates support only for the two instantiations collate<char>
and collate<wchar_t>.
For other character types, users must write their own specializations, potentially using the standard instantiations.
The messages
facet is used to retrieve internationalized messages from a catalog of messages. This facet is intended primarily to provide a service similar to that of the function perror().
This function is used in POSIX systems to print a system error message for an error number stored in the global variable errno.
Of course, the service provided by messages
is more flexible. Unfortunately, it is not defined very precisely.
The messages
facet is a template class that takes a character type charT
as its template argument. The strings returned from this facet are of type basic_string<charT>.
The basic use of this facet consists of opening a catalog, retrieving messages, and then closing the catalog. The class messages
derives from a class messages_base,
which defines a type catalog
(actually, it is a type definition for int
). An object of this type is used to identify the catalog on which the members of messages
operate. Table 14.24 lists the member functions of the messages
facet.
The name passed as the argument to the open()
function identifies the catalog in which the message strings are stored. This can be, for example, the name of a file. The loc
argument identifies a locale
object that is used to access a ctype
facet. This facet is used to convert the message to the desired character type.
The exact semantics of the get()
member are not defined. An implementation for POSIX systems could, for example, return the string corresponding to the error message for error msgid,
but this behavior is not required by the standard. The set
argument is intended to create a substructure
Table 14.24. Members of the messages<>
Facet
Expression | Meaning |
---|---|
msg.open (name , loc )
| Opens a catalog and returns a corresponding ID |
msg.get(cat,set,msgid,def)
| Returns the message with ID msgid from catalog cat; if there is no such message, def is returned instead
|
msg. close (cat)
| Closes the catalog |
within the messages. For example, it might be used to distinguish between system errors and errors of the C++ standard library.
When a message catalog is no longer needed, it can be released using the close()
function. Although the interface using open()
and close()
suggests that the messages are retrieved from a file as needed, this is by no means required. Actually, it is more likely that open()
reads a file and stores the messages in memory. A later call to close()
would then release this memory.
The standard requires that the two instantiations messages<char>
and messages<wchar_t>
be stored in each locale. The C++ standard library does not support any other instantiations.
[1] i18n is a common abbreviation for internationalization. It stands for the letter i, followed by 18 characters, followed by the letter n.
[2] Note that you have to put a space between the two ">"
characters. ">>"
would be parsed as shift operator, which would result in a syntax error.
[3] POSIX and X/Open are standards for operating system interfaces.
[4] See http://www/josuttis.com/libbook/examples.html
for a complete example program.
[4] This locale is only identical to the global C++ locale if the last call to locale:: global() was with a named locale and if there was no call to setlocale()
since then. Otherwise, the locale used by the C functions is different from the global C++ locale.
3.15.7.13