In this chapter
Early computing systems generally used English for their output (prompts, error messages) and input (responses to queries, such as “yes” and “no”). This was true of Unix systems, even into the mid-1980s. In the late 1980s, beginning with the first ISO standard for C and continuing with the POSIX standards of the 1990s and the current POSIX standard, facilities were developed to make it possible for programs to work in multiple languages, without a requirement to maintain multiple versions of the same program. This chapter describes how modern programs should deal with multiple-language issues.
The central concept is the locale, the place in which a program is run. Locales encapsulate information about the following: the local character set; how to display date and time information; how to format and display monetary amounts; and how to format and display numeric values (with or without a thousands separator, what character to use as the decimal point, and so on).
Internationalization is the process of writing (or modifying) a program so that it can function in multiple locales. Localization is the process of tailoring an internationalized program for a specific locale. These terms are often abbreviated i18n and l10n, respectively. (The numeric values indicate how many characters appear in the middle of the word, and these abbreviations bear a minor visual resemblance to the full terms. They’re also considerably easier to type.) Another term that appears frequently is native language support, abbreviated NLS; NLS refers to the programmatic support for doing i18n and l10n.
Additionally, some people use the term globalization (abbreviated g10n) to mean the process of preparing all possible localizations for an internationalized program. In other words, making it ready for global use.
NLS facilities exist at two levels. The first level is the C library. It provides information about the locale; routines to handle much of the low-level detail work for formatting date/time, numeric and monetary values; and routines for locale-correct regular expression matching and character classification and comparison. It is the library facilities that appear in the C and POSIX standards.
At the application level, GNU gettext
provides commands and a library for localizing a program: that is, making all output messages available in one or more natural languages. GNU gettext
is based on a design originally done by Sun Microsystems for Solaris;[1] however it was implemented from scratch and now provides extensions to the original Solaris gettext
. GNU gettext
is a de facto standard for program localization, particularly in the GNU world.
yiIn addition to locales and gettext
, Standard C provides facilities for working with multiple character sets and their encodings—ways to represent large character sets with fewer bytes. We touch on these issues, briefly, at the end of the chapter.
You control locale-specific behavior by setting environment variables to describe which locale(s) to use for particular kinds of information. The number of available locales offered by any particular operating system ranges from fewer than ten on some commercial Unix systems to hundreds of locales on GNU/Linux systems. (’locale -a
’ prints the full list of available locales.)
Two locales, “C
” and “POSIX
”, are guaranteed to exist. They act as the default locale, providing a 7-bit ASCII environment whose behavior is the same as traditional, non-locale-aware Unix systems. Otherwise, locales specify a language, country, and, optionally, character set information. For example, “it_IT
” is for Italian in Italy using the system’s default character set, and “it_IT.UTF-8
” uses the UTF-8 character encoding for the Unicode character set.
More details on locale names can be found in the GNU/Linux setlocale(3) manpage. Typically, GNU/Linux distributions set the default locale for a system when it’s installed, based on the language chosen by the installer, and users don’t need to worry about it anymore.
The <locale.h>
header file defines the locale functions and structures. Locale categories define the kinds of information about which a program will be locale-aware. The categories are available as a set of symbolic constants. They are listed in Table 13.1.
Table 13.1. ISO C locale category constants defined in <locale.h>
These categories are the ones defined by the various standards. Some systems may support additional categories, such LC_TELEPHONE
or LC_ADDRESS
. However, these are not standardized; any program that needs to use them but that still needs to be portable should use #ifdef
to enclose the relevant sections.
By default, C programs and the C library act as if they are in the “C
” or “POSIX
” locale, to provide compatibility with historical systems and behavior. However, by calling setlocale()
(as described below), a program can enable locale awareness. Once a program does this, the user can, by setting environment variables, enable and disable the degree of locale functionality that the program will have.
The environment variables have the same names as the locale categories listed in Table 13.1. Thus, the command—
export LC_NUMERIC=en_DK LC_TIME=C
—specifies that numbers should be printed according to the “en_DK
” (English in Denmark) locale, but that date and time values should be printed according to the regular “C
” locale. (This example merely illustrates that you can specify different locales for different categories; it’s not necessarily something that you should do.)
The environment variable LC_ALL
overrides all other LC_
xxx
variables. If LC_ALL
isn’t set, then the library looks for the specific variables (LC_CTYPE, LC_MONETARY
, and so on). Finally, if none of those is set, the library looks for the variable LANG
. Here is a small demonstration, using gawk
:
$ unset LC_ALL LANG Remove default variables $ export LC_NUMERIC=en_DK LC_TIME=C European numbers, default date, time $ gawk 'BEGIN { print 1.234 ; print strftime() }' Print a number, current date, time 1,234 Wed Jul 09 09:32:18 PDT 2003 $ export LC_NUMERIC=it_IT LC_TIME=it_IT Italian numbers, date, time $ gawk 'BEGIN { print 1.234 ; print strftime() }' Print a number, current date, time 1,234 mer lug 09 09:32:40 PDT 2003 $ export LC_ALL=C Set overriding variable $ gawk 'BEGIN { print 1.234 ; print strftime() }' Print a number, current date, time 1.234 Wed Jul 09 09:33:00 PDT 2003
(For awk, the POSIX standard states that numeric constants in the source code always use ’.’ as the decimal point, whereas numeric output follows the rules of the locale.)
Almost all GNU versions of the standard Unix utilities are locale-aware. Thus, particularly on GNU/Linux systems, setting these variables gives you control over the system’s behavior.[2]
As mentioned, if you do nothing, C programs and the C library act as if they’re in the “C
” locale. The setlocale()
function enables locale awareness:
#include <locale.h> ISO C
char *setlocale(int category, const char *locale);
The category
argument is one of the locale categories described in Section 13.2.1, “Locale Categories and Environment Variables,” page 487. The locale
argument is a string naming the locale to use for that category. When locale
is the empty string (“”), setlocale()
inspects the appropriate environment variables.
If locale
is NULL
, the locale information is not changed. Instead, the function returns a string representing the current locale for the given category.
Because each category can be set individually, the application’s author decides how locale-aware the program will be. For example, if main()
only does this—
setlocale(LC_TIME, ""); /* Be locale-aware for time, but that's it. */
—then, no matter what other LC_
xxx
variables are set in the environment, only the time and date functions obey the locale. All others act as if the program is still in the "C"
locale. Similarly, the call:
setlocale(LC_TIME, "it_IT"); /* For the time, we're always in Italy. */
overrides the LC_TIME
environment variable (as well as LC_ALL
), forcing the program to be Italian for time/date computations. (Although Italy may be a great place to be, programs are better off using ""
so that they work correctly everywhere; this example is here just to explain how setlocale()
works.)
You can call setlocale()
individually for each category, but the simplest thing to do is set everything in one fell swoop:
/* When in Rome, do as the Romans do, for *everything*. :-) */ setlocale(LC_ALL, "");
setlocale()
’s return value is the current setting of the locale. This is either a string value passed in from an earlier call or an opaque value representing the locale in use at startup. This same value can then later be passed back to setlocale()
. For later use, the return value should be copied into local storage since it is a pointer to internal data:
char *initial_locale; initial_locale = strdup(setlocale(LC_ALL, "")); /* save copy */ ... (void) setlocale(LC_ALL, initial_locale); /* restore it */
Here, we’ve saved a copy by using the POSIX strdup()
function (see Section 3.2.2, “String Copying: strdup(),” page 74).
The familiar strcmp()
function compares two strings, returning negative, zero, or positive values if the first string is less than, equal to, or greater than the second one. This comparison is based on the numeric values of characters in the machine’s character set. Because of this, strcmp()
’s result never varies.
However, in a locale-aware world, simple numeric comparison isn’t enough. Each locale defines the collating sequence for characters within it, in other words, the relative order of characters within the locale. For example, in simple 7-bit ASCII, the two characters A
and a
have the decimal numeric values 65 and 97, respectively. Thus, in the fragment
int i = strcmp("A", "a");
i
has a negative value. However, in the "en_US.UTF-8"
locale, A
comes after a
, not before it. Thus, using strcmp()
for applications that need to be locale-aware is a bad idea; we might say it returns a locale-ignorant answer.
The strcoll()
(string collate) function exists to compare strings in a locale-aware fashion:
#include <string.h> ISO C
int strcoll(const char *s1, const char *s2);
Its return value is the same negative/zero/positive as strcmp()
. The following program, ch13-compare.c
, interactively demonstrates the difference:
1 /* ch13-compare.c --- demonstrate strcmp() vs. strcoll() */ 2 3 #include <stdio.h> 4 #include <locale.h> 5 #include <string.h> 6 7 int main(void) 8 { 9 #define STRBUFSIZE 1024 10 char locale[STRBUFSIZE], curloc[STRBUFSIZE]; 11 char left[STRBUFSIZE], right[STRBUFSIZE]; 12 char buf[BUFSIZ]; 13 int count; 14 15 setlocale(LC_ALL, ""); /* set to env locale */ 16 strcpy(curloc, setlocale(LC_ALL, NULL)); /* save it */ 17 18 printf("--> "); fflush(stdout); 19 while (fgets(buf, sizeof buf, stdin) != NULL) { 20 locale[0] = ' '; 21 count = sscanf(buf, "%s %s %s", left, right, locale); 22 if (count < 2) 23 break; 24 25 if (*locale) { 26 setlocale(LC_ALL, locale); 27 strcpy(curloc, locale); 28 } 29 30 printf("%s: strcmp("%s", "%s") is %d ", curloc, left, 31 right, strcmp(left, right)); 32 printf("%s: strcoll("%s", "%s") is %d ", curloc, left, 33 right, strcoll(left, right)); 34 35 printf(" --> "); fflush(stdout); 36 } 37 38 exit(0); 39 }
The program reads input lines, which consist of two words to compare and, optionally, a locale to use for the comparison. If the locale is given, that becomes the locale for subsequent entries. It starts out with whatever locale is set in the environment.
The curloc
array saves the current locale for printing results; left
and right
are the left-and right-hand words to compare (lines 10–11). The main part of the program is a loop (lines 19–36) that reads lines and does the work. Lines 20–23 split up the input line. locale
is initialized to the empty string, in case a third value isn’t provided.
Lines 25–28 set the new locale if there is one. Lines 30–33 print the comparison results, and line 35 prompts for more input. Here’s a demonstration:
$ ch13-compare Run the program --> ABC abc Enter two words C: strcmp("ABC", "abc") is -1 Program started in "C" locale C: strcoll("ABC", "abc") is -1 Identical results in "C" locale --> ABC abc en_US Same words, "en_US" locale en_US: strcmp("ABC", "abc") is -1 strcmp() results don't change en_US: strcoll("ABC", "abc") is 2 strcoll() results do! --> ABC abc en_US.UTF-8 Same words, "en_US.UTF-8" locale en_US.UTF-8: strcmp("ABC", "abc") is -1 en_US.UTF-8: strcoll("ABC", "abc") is 6 Different value, still positive --> junk JUNK New words en_US.UTF-8: strcmp("junk", "JUNK") is 1 Previous locale used en_US.UTF-8: strcoll("junk", "JUNK") is -6
This program clearly demonstrates the difference between strcmp()
and strcoll()
. Since strcmp()
works in accordance with the numeric character values, it always returns the same result. strcoll()
understands collation issues, and its result varies according to the locale. We see that in both en_US
locales, the uppercase letters come after the lowercase ones.
Locale-specific string collation is also an issue in regular-expression matching. Regular expressions allow character ranges within bracket expressions, such as ’[a-z]
’ or ’["-/]
’. The exact meaning of such a construct (the characters numerically between the start and end points, inclusive) is defined only for the "C"
and "POSIX"
locales.
For non-ASCII locales, a range such as ’[a-z]
’ can also match uppercase letters, not just lowercase ones! The range ’["-/]
’ is valid in ASCII, but not in "en_US.UTF-8"
.
The long-term most portable solution is to use POSIX character classes, such as ’[[:lower:]]
’ and ’[[:punct:]]
’. If you find yourself needing to use range expressions on systems that are locale-aware and on older systems that are not, but without having to change your program, the solution is to use brute force and list each character individually within the brackets. It isn’t pretty, but it works.
Locale-based collation is potentially expensive. If you expect to be doing lots of comparisons, where at least one of the strings will not change or where string values will be compared against each other multiple times (such as in sorting a list), then you should consider using the strxfrm()
function to convert your strings to versions that can be used with strcmp()
. The strxfrm()
function is declared as follows:
#include <string.h> ISO C
size_t strxfrm(char *dest, const char *src, size_t n);
The idea is that strxfrm()
transforms the first n
characters of src
, placing them into dest
. The return value is the number of characters necessary to hold the transformed characters. If this is more than n
, then the contents of dest
are “indeterminate.”
The POSIX standard explicitly allows n
to be zero and dest
to be NULL
. In this case, strxfrm()
returns the size of the array needed to hold the transformed version of src
(not including the final ’