Chapter 13. Internationalization and Localization

In this chapter

  • 13.1 Introduction page 486

  • 13.2 Locales and the C Library page 487

  • 13.3 Dynamic Translation of Program Messages page 507

  • 13.4 Can You Spell That for Me, Please? page 521

  • 13.5 Suggested Reading page 526

  • 13.6 Summary page 526

  • Exercises page 527

Early computing systems generally used English for their output (prompts, error messages) and input (responses to queries, such as “yes” and “no”). This was true of Unix systems, even into the mid-1980s. In the late 1980s, beginning with the first ISO standard for C and continuing with the POSIX standards of the 1990s and the current POSIX standard, facilities were developed to make it possible for programs to work in multiple languages, without a requirement to maintain multiple versions of the same program. This chapter describes how modern programs should deal with multiple-language issues.

Introduction

The central concept is the locale, the place in which a program is run. Locales encapsulate information about the following: the local character set; how to display date and time information; how to format and display monetary amounts; and how to format and display numeric values (with or without a thousands separator, what character to use as the decimal point, and so on).

Internationalization is the process of writing (or modifying) a program so that it can function in multiple locales. Localization is the process of tailoring an internationalized program for a specific locale. These terms are often abbreviated i18n and l10n, respectively. (The numeric values indicate how many characters appear in the middle of the word, and these abbreviations bear a minor visual resemblance to the full terms. They’re also considerably easier to type.) Another term that appears frequently is native language support, abbreviated NLS; NLS refers to the programmatic support for doing i18n and l10n.

Additionally, some people use the term globalization (abbreviated g10n) to mean the process of preparing all possible localizations for an internationalized program. In other words, making it ready for global use.

NLS facilities exist at two levels. The first level is the C library. It provides information about the locale; routines to handle much of the low-level detail work for formatting date/time, numeric and monetary values; and routines for locale-correct regular expression matching and character classification and comparison. It is the library facilities that appear in the C and POSIX standards.

At the application level, GNU gettext provides commands and a library for localizing a program: that is, making all output messages available in one or more natural languages. GNU gettext is based on a design originally done by Sun Microsystems for Solaris;[1] however it was implemented from scratch and now provides extensions to the original Solaris gettext. GNU gettext is a de facto standard for program localization, particularly in the GNU world.

yiIn addition to locales and gettext, Standard C provides facilities for working with multiple character sets and their encodings—ways to represent large character sets with fewer bytes. We touch on these issues, briefly, at the end of the chapter.

Locales and the C Library

You control locale-specific behavior by setting environment variables to describe which locale(s) to use for particular kinds of information. The number of available locales offered by any particular operating system ranges from fewer than ten on some commercial Unix systems to hundreds of locales on GNU/Linux systems. (’locale -a’ prints the full list of available locales.)

Two locales, “C” and “POSIX”, are guaranteed to exist. They act as the default locale, providing a 7-bit ASCII environment whose behavior is the same as traditional, non-locale-aware Unix systems. Otherwise, locales specify a language, country, and, optionally, character set information. For example, “it_IT” is for Italian in Italy using the system’s default character set, and “it_IT.UTF-8” uses the UTF-8 character encoding for the Unicode character set.

More details on locale names can be found in the GNU/Linux setlocale(3) manpage. Typically, GNU/Linux distributions set the default locale for a system when it’s installed, based on the language chosen by the installer, and users don’t need to worry about it anymore.

Locale Categories and Environment Variables

The <locale.h> header file defines the locale functions and structures. Locale categories define the kinds of information about which a program will be locale-aware. The categories are available as a set of symbolic constants. They are listed in Table 13.1.

Table 13.1. ISO C locale category constants defined in <locale.h>

Category

Meaning

LC_ALL

This category includes all possible locale information. This consists of the rest of the items in this table.

LC_COLLATE

The category for string collation (discussed below) and regular expression ranges.

LC_CTYPE

The category for classifying characters (upper case, lower case, etc.). This affects regular expression matching and the isXXX() functions in <ctype.h>.

LC_MESSAGES

The category for locale-specific messages. This category comes into play with GNU gettext, discussed later in the chapter.

LC_MONETARY

The category for formatting monetary information, such as the local and international symbols for the local currency (for example, $ vs. USD for U.S. dollars), how to format negative values, and so on.

LC_NUMERIC

The category for formatting numeric values.

LC_TIME

The category for formatting dates and times.

These categories are the ones defined by the various standards. Some systems may support additional categories, such LC_TELEPHONE or LC_ADDRESS. However, these are not standardized; any program that needs to use them but that still needs to be portable should use #ifdef to enclose the relevant sections.

By default, C programs and the C library act as if they are in the “C” or “POSIX” locale, to provide compatibility with historical systems and behavior. However, by calling setlocale() (as described below), a program can enable locale awareness. Once a program does this, the user can, by setting environment variables, enable and disable the degree of locale functionality that the program will have.

The environment variables have the same names as the locale categories listed in Table 13.1. Thus, the command—

export LC_NUMERIC=en_DK LC_TIME=C

—specifies that numbers should be printed according to the “en_DK” (English in Denmark) locale, but that date and time values should be printed according to the regular “C” locale. (This example merely illustrates that you can specify different locales for different categories; it’s not necessarily something that you should do.)

The environment variable LC_ALL overrides all other LC_xxx variables. If LC_ALL isn’t set, then the library looks for the specific variables (LC_CTYPE, LC_MONETARY, and so on). Finally, if none of those is set, the library looks for the variable LANG. Here is a small demonstration, using gawk:

$ unset LC_ALL LANG                                  Remove default variables
$ export LC_NUMERIC=en_DK LC_TIME=C                  European numbers, default date, time
$ gawk 'BEGIN { print 1.234 ; print strftime() }'    Print a number, current date, time
1,234
Wed Jul 09 09:32:18 PDT 2003
$ export LC_NUMERIC=it_IT LC_TIME=it_IT              Italian numbers, date, time
$ gawk 'BEGIN { print 1.234 ; print strftime() }'    Print a number, current date, time
1,234
mer lug 09 09:32:40 PDT 2003
$ export LC_ALL=C                                    Set overriding variable
$ gawk 'BEGIN { print 1.234 ; print strftime() }'    Print a number, current date, time
1.234
Wed Jul 09 09:33:00 PDT 2003

(For awk, the POSIX standard states that numeric constants in the source code always use ’.’ as the decimal point, whereas numeric output follows the rules of the locale.)

Almost all GNU versions of the standard Unix utilities are locale-aware. Thus, particularly on GNU/Linux systems, setting these variables gives you control over the system’s behavior.[2]

Setting the Locale: setlocale()

As mentioned, if you do nothing, C programs and the C library act as if they’re in the “C” locale. The setlocale() function enables locale awareness:

#include <locale.h>                                           ISO C

char *setlocale(int category, const char *locale);

The category argument is one of the locale categories described in Section 13.2.1, “Locale Categories and Environment Variables,” page 487. The locale argument is a string naming the locale to use for that category. When locale is the empty string (“”), setlocale() inspects the appropriate environment variables.

If locale is NULL, the locale information is not changed. Instead, the function returns a string representing the current locale for the given category.

Because each category can be set individually, the application’s author decides how locale-aware the program will be. For example, if main() only does this—

setlocale(LC_TIME, "");           /* Be locale-aware for time, but that's it. */

—then, no matter what other LC_xxx variables are set in the environment, only the time and date functions obey the locale. All others act as if the program is still in the "C" locale. Similarly, the call:

setlocale(LC_TIME, "it_IT");     /* For the time, we're always in Italy. */

overrides the LC_TIME environment variable (as well as LC_ALL), forcing the program to be Italian for time/date computations. (Although Italy may be a great place to be, programs are better off using "" so that they work correctly everywhere; this example is here just to explain how setlocale() works.)

You can call setlocale() individually for each category, but the simplest thing to do is set everything in one fell swoop:

/* When in Rome, do as the Romans do, for *everything*. :-) */
setlocale(LC_ALL, "");

setlocale()’s return value is the current setting of the locale. This is either a string value passed in from an earlier call or an opaque value representing the locale in use at startup. This same value can then later be passed back to setlocale(). For later use, the return value should be copied into local storage since it is a pointer to internal data:

char *initial_locale;

initial_locale = strdup(setlocale(LC_ALL, ""));   /* save copy */
...
(void) setlocale(LC_ALL, initial_locale);         /* restore it */

Here, we’ve saved a copy by using the POSIX strdup() function (see Section 3.2.2, “String Copying: strdup(),” page 74).

String Collation: strcoll() and strxfrm()

The familiar strcmp() function compares two strings, returning negative, zero, or positive values if the first string is less than, equal to, or greater than the second one. This comparison is based on the numeric values of characters in the machine’s character set. Because of this, strcmp()’s result never varies.

However, in a locale-aware world, simple numeric comparison isn’t enough. Each locale defines the collating sequence for characters within it, in other words, the relative order of characters within the locale. For example, in simple 7-bit ASCII, the two characters A and a have the decimal numeric values 65 and 97, respectively. Thus, in the fragment

int i = strcmp("A", "a");

i has a negative value. However, in the "en_US.UTF-8" locale, A comes after a, not before it. Thus, using strcmp() for applications that need to be locale-aware is a bad idea; we might say it returns a locale-ignorant answer.

The strcoll() (string collate) function exists to compare strings in a locale-aware fashion:

#include <string.h>                                            ISO C

int strcoll(const char *s1, const char *s2);

Its return value is the same negative/zero/positive as strcmp(). The following program, ch13-compare.c, interactively demonstrates the difference:

 1   /* ch13-compare.c --- demonstrate strcmp() vs. strcoll() */
 2
 3   #include <stdio.h>
 4   #include <locale.h>
 5   #include <string.h>
 6
 7   int main(void)
 8   {
 9   #define STRBUFSIZE  1024
10       char locale[STRBUFSIZE], curloc[STRBUFSIZE];
11       char left[STRBUFSIZE], right[STRBUFSIZE];
12       char buf[BUFSIZ];
13       int count;
14
15       setlocale(LC_ALL, "");              /* set to env locale */
16       strcpy(curloc, setlocale(LC_ALL, NULL));    /* save it */
17
18       printf("--> "); fflush(stdout);
19       while (fgets(buf, sizeof buf, stdin) != NULL) {
20           locale[0] = '';
21           count = sscanf(buf, "%s %s %s", left, right, locale);
22           if (count < 2)
23               break;
24
25           if (*locale) {
26               setlocale(LC_ALL, locale);
27               strcpy(curloc, locale);
28           }
29
30           printf("%s: strcmp("%s", "%s") is %d
", curloc, left,
31                   right, strcmp(left, right));
32           printf("%s: strcoll("%s", "%s") is %d
", curloc, left,
33                   right, strcoll(left, right));
34
35           printf("
--> "); fflush(stdout);
36       }
37
38       exit(0);
39   }

The program reads input lines, which consist of two words to compare and, optionally, a locale to use for the comparison. If the locale is given, that becomes the locale for subsequent entries. It starts out with whatever locale is set in the environment.

The curloc array saves the current locale for printing results; left and right are the left-and right-hand words to compare (lines 10–11). The main part of the program is a loop (lines 19–36) that reads lines and does the work. Lines 20–23 split up the input line. locale is initialized to the empty string, in case a third value isn’t provided.

Lines 25–28 set the new locale if there is one. Lines 30–33 print the comparison results, and line 35 prompts for more input. Here’s a demonstration:

$ ch13-compare                                 Run the program
--> ABC abc                                    Enter two words
C: strcmp("ABC", "abc") is -1                  Program started in "C" locale
C: strcoll("ABC", "abc") is -1                 Identical results in "C" locale

--> ABC abc en_US                              Same words, "en_US" locale
en_US: strcmp("ABC", "abc") is -1              strcmp() results don't change
en_US: strcoll("ABC", "abc") is 2              strcoll() results do!

--> ABC abc en_US.UTF-8                        Same words, "en_US.UTF-8" locale
en_US.UTF-8: strcmp("ABC", "abc") is -1
en_US.UTF-8: strcoll("ABC", "abc") is 6        Different value, still positive

--> junk JUNK                                  New words
en_US.UTF-8: strcmp("junk", "JUNK") is 1       Previous locale used
en_US.UTF-8: strcoll("junk", "JUNK") is -6

This program clearly demonstrates the difference between strcmp() and strcoll(). Since strcmp() works in accordance with the numeric character values, it always returns the same result. strcoll() understands collation issues, and its result varies according to the locale. We see that in both en_US locales, the uppercase letters come after the lowercase ones.

Note

Locale-specific string collation is also an issue in regular-expression matching. Regular expressions allow character ranges within bracket expressions, such as ’[a-z]’ or ’["-/]’. The exact meaning of such a construct (the characters numerically between the start and end points, inclusive) is defined only for the "C" and "POSIX" locales.

For non-ASCII locales, a range such as ’[a-z]’ can also match uppercase letters, not just lowercase ones! The range ’["-/]’ is valid in ASCII, but not in "en_US.UTF-8".

The long-term most portable solution is to use POSIX character classes, such as ’[[:lower:]]’ and ’[[:punct:]]’. If you find yourself needing to use range expressions on systems that are locale-aware and on older systems that are not, but without having to change your program, the solution is to use brute force and list each character individually within the brackets. It isn’t pretty, but it works.

Locale-based collation is potentially expensive. If you expect to be doing lots of comparisons, where at least one of the strings will not change or where string values will be compared against each other multiple times (such as in sorting a list), then you should consider using the strxfrm() function to convert your strings to versions that can be used with strcmp(). The strxfrm() function is declared as follows:

#include <string.h>                                                 ISO C

size_t strxfrm(char *dest, const char *src, size_t n);

The idea is that strxfrm() transforms the first n characters of src, placing them into dest. The return value is the number of characters necessary to hold the transformed characters. If this is more than n, then the contents of dest are “indeterminate.”

The POSIX standard explicitly allows n to be zero and dest to be NULL. In this case, strxfrm() returns the size of the array needed to hold the transformed version of src (not including the final ’’ character). Presumably, this value would then be used with malloc() for creating the dest array or for checking the size against a predefined array bound. (When doing this, obviously, src must have a terminating zero byte.) This fragment illustrates how to use strxfrm():

#define STRBUFSIZE ...
char s1[STRBUFSIZE], s2[STRBUFSIZE];                 Original strings
char s1x[STRBUFSIZE], s2x[STRBUFSIZE];               Transformed copies
size_t len1, len2;
int cmp;

... fill in s1 and s2...
len1 = strlen(s1);
len2 = strlen(s2);

if (strxfrm(s1x, s1, len1) >= STRBUFSIZE | | strxfrm(s2x, s2, len2) >= STRBUFSIZE)
    /* too big, recover */

cmp = strcmp(s1x, s2x);
if (cmp == 0)
    /* equal */
else if (cmp < 0)
    /* s1 < s2 */
else
    /* s1 > s2 */

For one-time comparisons, it is probably faster to use strcoll() directly. But if strings will be compared multiple times, then using strxfrm() once and strcmp() on the transformed values will be faster.

There are no locale-aware collation functions that correspond to strncmp() or strcasecmp().

Low-Level Numeric and Monetary Formatting: localeconv()

Correctly formatting numeric and monetary values requires a fair amount of lowlevel information. Said information is available in the struct lconv, which is retrieved with the localeconv() function:

#include <locale.h>                                         ISO C

struct lconv *localeconv(void);

Similarly to the ctime() function, this function returns a pointer to internal static data. You should make a copy of the returned data since subsequent calls could return different values if the locale has been changed. Here is the struct lconv (condensed slightly), direct from GLIBC’s <locale.h>:

 struct lconv {
   /* Numeric (non-monetary) information.  */
   char *decimal_point;          /* Decimal point character.  */
   char *thousands_sep;          /* Thousands separator.  */
   /* Each element is the number of digits in each group;
      elements with higher indices are farther left.
      An element with value CHAR_MAX means that no further grouping is done.
      An element with value 0 means that the previous element is used
      for all groups farther left.  */
   char *grouping;

   /* Monetary information.  */
   /* First three chars are a currency symbol from ISO 4217.
      Fourth char is the separator.  Fifth char is ''.  */
   char *int_curr_symbol;
   char *currency_symbol;        /* Local currency symbol.  */
   char *mon_decimal_point;      /* Decimal point character.  */
   char *mon_thousands_sep;      /* Thousands separator.  */
   char *mon_grouping;           /* Like 'grouping' element (above).  */
   char *positive_sign;          /* Sign for positive values.  */
   char *negative_sign;          /* Sign for negative values.  */
   char int_frac_digits;         /* Int'l fractional digits.  */
   char frac_digits;             /* Local fractional digits.  */
   /* 1 if currency_symbol precedes a positive value, 0 if succeeds.  */
   char p_cs_precedes;
   /* 1 iff a space separates currency_symbol from a positive value.  */
   char p_sep_by_space;
   /* 1 if currency_symbol precedes a negative value, 0 if succeeds.  */
   char n_cs_precedes;
   /* 1 iff a space separates currency_symbol from a negative value.  */
   char n_sep_by_space;
   /* Positive and negative sign positions:
      0 Parentheses surround the quantity and currency_symbol.
      1 The sign string precedes the quantity and currency_symbol.
      2 The sign string follows the quantity and currency_symbol.
      3 The sign string immediately precedes the currency_symbol.
      4 The sign string immediately follows the currency_symbol.  */
   char p_sign_posn;
   char n_sign_posn;
   /* 1 if int_curr_symbol precedes a positive value, 0 if succeeds.  */
   char int_p_cs_precedes;
   /* 1 iff a space separates int_curr_symbol from a positive value.  */
   char int_p_sep_by_space;
   /* 1 if int_curr_symbol precedes a negative value, 0 if succeeds.  */
   char int_n_cs_precedes;
   /* 1 iff a space separates int_curr_symbol from a negative value.  */
   char int_n_sep_by_space;
   /* Positive and negative sign positions:
      0 Parentheses surround the quantity and int_curr_symbol.
      1 The sign string precedes the quantity and int_curr_symbol.
      2 The sign string follows the quantity and int_curr_symbol.
      3 The sign string immediately precedes the int_curr_symbol.
      4 The sign string immediately follows the int_curr_symbol.  */
  char int_p_sign_posn;
  char int_n_sign_posn;
};

The comments make it fairly clear what’s going on. Let’s look at the first several fields in the struct lconv:

decimal_point

  • The decimal point character to use. In the United States and other Englishspeaking countries, it’s a period, but many countries use a comma.

thousands_sep

  • The character to separate each 3 digits in a value.

grouping

  • An array of single-byte integer values. Each element indicates how many digits to group. As the comment says, CHAR_MAX means no further grouping should be done, and 0 means reuse the last element. (We show some sample code later in the chapter.)

int_curr_symbol

  • This is the international symbol for the local currency. For example, ’USD’ for U.S. dollars.

currency_symbol

  • This is the local symbol for the local currency. For example, $ for U.S. dollars.

mon_decimal_point, mon_thousands_sep, mon_grouping

  • These correspond to the earlier fields, providing the same information, but for monetary amounts.

Most of the rest of the values are not useful for day-to-day programming. The following program, ch13-lconv.c, prints some of these values, to give you a feel for what kind of information is available:

/* ch13-lconv.c --- show some of the components of the struct lconv */

#include <stdio.h>
#include <limits.h>
#include <locale.h>

int main(void)
{
    struct lconv l;
    int i;

    setlocale(LC_ALL, "");
    l = *localeconv();

    printf("decimal_point = [%s]
", l.decimal_point);
    printf("thousands_sep = [%s]
", l.thousands_sep);

    for (i = 0; l.grouping[i] != 0 && l.grouping[i] != CHAR_MAX; i++)
        printf("grouping[%d] = [%d]
", i, l.grouping[i]);

    printf("int_curr_symbol = [%s]
", l.int_curr_symbol);
    printf("currency_symbol = [%s]
", l.currency_symbol);
    printf("mon_decimal_point = [%s]
", l.mon_decimal_point);
    printf("mon_thousands_sep = [%s]
", l.mon_thousands_sep);
    printf("positive_sign = [%s]
", l.positive_sign);
    printf("negative_sign = [%s]
", l.negative_sign);
}

When run with different locales, not surprisingly we get different results:

$ LC_ALL=en_US ch13-lconv            Results for the United States
decimal_point = [.]
thousands_sep = [,]
grouping[0] = [3]
grouping[1] = [3]
int_curr_symbol = [USD ]
currency_symbol = [$]
mon_decimal_point = [.]
mon_thousands_sep = [,]
positive_sign = []
negative_sign = [-]

$ LC_ALL=it_IT ch13-lconv            Results for Italy
decimal_point = [.]
thousands_sep = []
int_curr_symbol = []
currency_symbol = []
mon_decimal_point = []
mon_thousands_sep = []
positive_sign = []
negative_sign = []

Note how the value for int_curr_symbol in the "en_US" locale includes a trailing space character that acts to separate the symbol from the following monetary value.

High-Level Numeric and Monetary Formatting: strfmon() and printf()

After looking at all the fields in the struct lconv, you may be wondering, “Do I really have to figure out how to use all that information just to format a monetary value?” Fortunately, the answer is no.[3] The strfmon() function does all the work for you:

#include <monetary.h>                                            POSIX

ssize_t strfmon(char *s, size_t max, const char *format, ...);

This routine is much like strftime() (see Section 6.1.3.2, “Complex Time Formatting: strftime(),” page 171), using format to copy literal characters and formatted numeric values into s, placing no more than max characters into it. The following simple program, ch13-strfmon.c, demonstrates how strfmon() works:

/* ch13-strfmon.c --- demonstrate strfmon() */

#include <stdio.h>
#include <locale.h>
#include <monetary.h>

int main(void)
{
    char buf[BUFSIZ];
    double val = 1234.567;

    setlocale(LC_ALL, "");
    strfmon(buf, sizeof buf, "You owe me %n (%i)
", val, val);

    fputs(buf, stdout);
    return 0;
}

When run in two different locales, it produces this output:

$ LC_ALL=en_US ch13-strfmon        In the United States
You owe me $1,234.57 (USD 1,234.57)
$ LC_ALL=it_IT ch13-strfmon        In Italy
You owe me EUR 1.235 (EUR  1.235)

As you can see, strfmon() is like strftime(), copying regular characters unchanged into the destination buffer and formatting arguments according to its own formatting specifications. There are only three:

%n

Print the national (that is, local) form of the currency value.

%i

Print the international form of the currency value.

%%

Print a literal % character.

The values to be formatted must be of type double. We see the difference between %n and %i in the "en_US" locale: %n uses a $ character, whereas %i uses USD, which stands for “U.S. Dollars.”

Flexibility—and thus a certain amount of complexity—comes along with many of the APIs that were developed for POSIX, and strfmon() is no exception. As with printf(), several optional items that can appear between the % and the i or n provide increased control. The full forms are as follows:

%[flags] [field width] [#left-prec] [.right-prec]i
%[flags] [field width] [#left-prec] [.right-prec]n
%%                                                         No flag, field width, etc., allowed

The flags are listed in Table 13.2.

Table 13.2. Flags for strfmon()

Flag

Meaning

=c

Use the character c for the numeric fill character, for use with the left precision. The default fill character is a space. A common alternative fill character is 0.

^

Disable the use of the grouping character (for example, a comma in the United States).

(

Enclose negative amounts in parentheses. Mutually exclusive with the + flag.

+

Handle positive/negative values normally. Use the locale’s positive and negative signs. Mutually exclusive with the ( flag.

!

Do not include the currency symbol. This flag is useful if you wish to use strfmon() to get more flexible formatting of regular numbers than what sprintf() provides.

-

Left-justify the result. The default is right justification. This flag has no effect without a field width.

The field width is a decimal digit string, providing a minimum width. The default is to use as many characters as necessary based on the rest of the specification. Values smaller than the field width are padded with spaces on the left (or on the right, if the ’-’ flag was given).

The left precision consists of a # character and a decimal digit string. It indicates the minimum number of digits to appear to the left of the decimal point character;[4] if the converted value is smaller than this, the result is padded with the numeric fill character. The default is a space, but the = flag can be used to change it. Grouping characters are not included in the count.

Finally, the right precision consists of a ’.’ character and a decimal digit string. This indicates how many digits to round the value to before it is formatted. The default is provided by the frac_digits and int_frac_digits fields in the struct lconv. If this value is 0, no decimal point character is printed.

strfmon() returns the number of characters placed into the buffer, not including the terminating zero byte. If there’s not enough room, it returns -l and sets errno to E2BIG.

Besides strfmon(), POSIX (but not ISO C) provides a special flag—the single-quote character, ’—for the printf() formats %i, %d, %u, %f, %F, %g, and %G. In locales that supply a thousands separator, this flag adds the locale’s thousands separator. The following simple program, ch13-quoteflag.c, demonstrates the output:

/* ch13-quoteflag.c --- demonstrate printf's quote flag */

#include <stdio.h>
#include <locale.h>

int main(void)
{
    setlocale(LC_ALL, "");         /* Have to do this, or it won't work */
    printf("%'d
", 1234567);
    return 0;
}

Here’s what happens for two different locales: one that does not supply a thousands separator and one that does:

$ LC_ALL=C ch13-quoteflag               Traditional environment, no separator
1234567
$ LC_ALL=en_US ch13-quoteflag           English in United States locale, has separator
1,234,567

As of this writing, only GNU/Linux and Solaris support the ' flag. Double-check your system's printf(3) manpage.

Example: Formatting Numeric Values in gawk

gawk implements its own version of the printf() and sprintf() functions. For full locale awareness, gawk must support the ' flag, as in C. The following fragment, from the file builtin.c in gawk 3.1.4, shows how gawk uses the struct lconv for numeric formatting:

 1   case 'd':
 2   case 'i':
 3       ...
 4       tmpval = force_number(arg);
 5
 6       ...
 7       uval = (uintmax_t) tmpval;
 8       ...
 9       ii = jj = 0;
10       do {
11           *--cp = (char) ('0' + uval % 10);
12   #ifdef HAVE_LOCALE_H
13           if (quote_flag && loc.grouping[ii] && ++jj == loc.grouping[ii]) {
14               *--cp = loc.thousands_sep[0]; /* XXX - assumption it's one char */
15               if (loc.grouping[ii+1] == 0)
16                   jj = 0;   /* keep using current val in loc.grouping[ii] */
17               else if (loc.grouping[ii+1] == CHAR_MAX)
18                   quote_flag = FALSE;
19               else {
20                   ii++;
21                   jj = 0;
22               }
23           }
24   #endif
25           uval /= 10;
26       } while (uval > 0);

(The line numbers are relative to the start of the fragment.) Some parts of the code that aren’t relevant to the discussion have been omitted to make it easier to focus on the parts that are important.

The variable loc, used in lines 13–17, is a struct lconv. It’s initialized in main(). Of interest to us here are loc.thousands_sep, which is the thousands-separator character, and loc.grouping, which is an array describing how many digits between separators. A zero element means “use the value in the previous element for all subsequent digits,” and a value of CHAR_MAX means “stop inserting thousands separators.”

With that introduction, let’s look at the code. Line 7 sets uval, which is an unsigned version of the value to be formatted. ii and jj keep track of the position in loc.grouping and the number of digits in the current group that have been converted, respectively.[5] quote_flag is true when a ’ character has been seen in a conversion specification.

The do-while loop generates digit characters in reverse, filling in a buffer from the back end toward the front end. Each digit is generated on line 11. Line 25 then divides by 10, shifting the value right by one decimal digit.

Lines 12–24 are what interest us. The work is done only on a system that supports locales, as indicated by the presence of the <locale.h> header file. The symbolic constant HAVE_LOCALE_H will be true on such a system.[6]

When the condition on line 13 is true, it’s time to add in a thousands-separator character. This condition can be read in English as “if grouping is requested, and the current position in loc.grouping indicates an amount for grouping, and the current count of digits equals the grouping amount.” If this condition is true, line 14 adds the thousands separator character. The comment notes an assumption that is probably true but that might come back to haunt the maintainer at some later time. (The ’XXX’ is a traditional way of marking dangerous or doubtful code. It’s easy to search for and very noticeable to readers of the code.)

Once the current position in loc.grouping has been used, lines 15–22 look ahead at the value in the next position. If it’s 0, then the current position’s value should continue to be used. We specify this by resetting jj to 0 (line 16). On the other hand, if the next position is CHAR_MAX, no more grouping should be done, and line 18 turns it off entirely by setting quote_flag to false. Otherwise, the next value is a grouping value, so line 20 resets jj to 0, and line 21 increments ii.

This is low-level, detailed code. However, once you understand how the information in the struct lconv is presented, the code is straightforward to read (and it was straightforward to write).

Formatting Date and Time Values: ctime() and strftime()

Section 6.1, “Times and Dates,” page 166, described the functions for retrieving and formatting time and date values. The strftime() function is also locale-aware if setlocale() has been called appropriately. The following simple program, ch13-times.c demonstrates this:

/* ch13-times.c --- demonstrate locale-based times */

#include <stdio.h>
#include <locale.h>
#include <time.h>

int main(void)
{
    char buf[100];
    time_t now;
    struct tm *curtime;

    setlocale(LC_ALL, "");
    time(& now);
    curtime = localtime(& now);
    (void) strftime(buf, sizeof buf,
            "It is now %A, %B %d, %Y, %I:%M %p", curtime);

    printf("%s
", buf);

    printf("ctime() says: %s", ctime(& now));
    exit(0);
}

When the program is run, we see that indeed the strftime() results vary while the ctime() results do not:

$ LC_ALL=en_US ch13-times                         Time in the United States
It is now Friday, July 11, 2003, 10:35 AM
ctime() says: Fri Jul 11 10:35:55 2003

$ LC_ALL=it_IT ch13-times                         Time in Italy
It is now venerdì, luglio 11, 2003, 10:36
ctime() says: Fri Jul 11 10:36:00 2003

$ LC_ALL=fr_FR ch13-times                         Time in France
It is now vendredi, juillet 11, 2003, 10:36
ctime() says: Fri Jul 11 10:36:05 2003

The reason for the lack of variation is that ctime() (and asctime(), upon which ctime() is based) are legacy interfaces; they exist to support old code. strftime(), being a newer interface (developed initially for C89), is free to be locale-aware.

Other Locale Information: nl_langinfo()

Although we said earlier that the catgets() API is hard to use, one part of that API is generally useful: nl_langinfo(). It provides additional locale-related information, above and beyond that which is available from the struct lconv:

#include <nl_types.h>                     XSI
#include <langinfo.h>

char *nl_langinfo(nl_item item);

The <nl_types.h> header file defines the nl_item type. (This is most likely an int or an enum.) The item parameter is one of the symbolic constants defined in <langinfo.h>. The return value is a string that can be used as needed, either directly or as a format string for strftime().

The available information comes from several locale categories. Table 13.3 lists the item constants, the corresponding locale category, and the item’s meaning.

Table 13.3. Item values for nl_langinfo()

Item name

Category

Meaning

ABDAY_1, ..., ABDAY_7

LC_TIME

The abbreviated names of the days of the week. Sunday is Day 1.

ABMON_1, ..., ABMON_12

LC_TIME

The abbreviated names of the months.

ALT_DIGITS

LC_TIME

Alternative symbols for digits; see text.

AM_STR, PM_STR

LC_TIME

The a.m./p.m. notations for the locale.

CODESET

LC_TYPE

The name of the locale’s codeset; that is, the character set and encoding in use.

CRNCYSTR

LC_MONETARY

The local currency symbol, described below.

DAY_1, ..., DAY_7

LC_TIME

The names of the days of the week. Sunday is Day 1.

D_FMT

LC_TIME

The date format.

D_T_FMT

LC_TIME

The date and time format.

ERA_D_FMT

LC_TIME

The era date format.

ERA_D_T_FMT

LC_TIME

The era date and time format.

ERA_T_FMT

LC_TIME

The era time format.

ERA

LC_TIME

Era description segments; see text.

MON_1, ..., MON_12

LC_TIME

The names of the months.

RADIXCHAR

LC_NUMERIC

The radix character. For base 10, this is the decimal point character.

THOUSEP

LC_NUMERIC

The thousands-separator character.

T_FMT_AMPM

LC_TIME

The time format with a.m./p.m. notation.

T_FMT

LC_TIME

The time format.

YESEXPR, NOEXPR

LC_MESSAGES

Strings representing positive and negative responses.

An era is a particular time in history. As it relates to dates and times, it makes the most sense in countries ruled by emperors or dynasties.[7]

POSIX era specifications can describe eras before A.D. 1. In such a case, the start date has a higher absolute numeric value than the end date. For example, Alexander the Great ruled from 336 B.C. to 323 B.C.

The value returned by ’nl_langinfo (ERA)’, if not NULL, consists of one or more era specifications. Each specification is separated from the next by a ; character. Components of each era specification are separated from each other by a : character. The components are described in Table 13.4.

Table 13.4. Era specification components

Component

Meaning

Direction

A + or ’-’ character. A + indicates that the era runs from a numerically lower year to a numerically higher one, and a ’-’ indicates the opposite.

Offset

The year closest to the start date of the era.

Start date

The date when the era began, in the form ’yyyy/mm/dd. These are the year, month, and day, respectively. Years before A.D. 1 use a negative value for yyyy.

End date

The date when the era ended, in the same form. Two additional special forms are allowed: -* means the “beginning of time,” and +* means the “end of time.”

Era name

The name of the era, corresponding to strftime()’s %EC conversion specification.

Era format

The format of the year within the era, corresponding to strftime()’s %EY conversion specification.

The ALT_DIGITS value also needs some explanation. Some locales provide for “alternative digits.” (Consider Arabic, which uses the decimal numbering system but different glyphs for the digits 0–9. Or consider a hypothetical “Ancient Rome” locale using roman numerals.) These come up, for example, in strftime()’s various %Oc conversion specifications. The return value for ’nl_langinfo(ALT_DIGITS)’ is a semicolon-separated list of character strings for the alternative digits. The first should be used for 0, the next for 1, and so on. POSIX states that up to 100 alternative symbols may be provided. The point is to avoid restricting locales to the use of the ASCII digit characters when a locale has its own numbering system.

Finally, ’nl_langinfo(CRNCYSTR)’ returns the local currency symbol. The first character of the return value, if it’s a ’-’, +, or ’.’, indicates how the symbol should be used:

-

The symbol should appear before the value.

+

The symbol should appear after the value.

.

The symbol should replace the radix character (decimal point).

Dynamic Translation of Program Messages

The standard C library interfaces just covered solve the easy parts of the localization problem. Monetary, numeric, and time and date values, as well as string collation issues, all lend themselves to management through tables of locale-specific data (such as lists of month and day names).

However, most user interaction with a text-based program occurs in the form of the messages it outputs, such as prompts or error messages. The problem is to avoid having multiple versions of the same program that differ only in the contents of the message strings. The de facto solution in the GNU world is GNU gettext. (GUI programs face similar issues with the items in menus and menu bars; typically, each major user interface toolkit has its own way to solve that problem.)

GNU gettext enables translation of program messages into different languages at runtime. Within the code for a program, this translation involves several steps, each of which uses different library functions. Once the program itself has been properly prepared, several shell-level utilities facilitate the preparation of translations into different languages. Each such translation is referred to as a message catalog.

Setting the Text Domain: textdomain()

A complete application may contain multiple components: individual executables written in C or C++ or in scripting languages that can also access gettext facilities, such as gawk or the Bash shell. The components of the application all share the same text domain, which is a string that uniquely identifies the application. (Examples might be "gawk" or "coreutils"; the former is a single program, and the latter is a whole suite of programs.) The text domain is set with textdomain():

#include <libintl.h>                                       GLIBC

char *textdomain(const char *domainname);

Each component should call this function with a string naming the text domain as part of the initial startup activity in main(). The return value is the current text domain. If the domainname argument is NULL, then the current domain is returned; otherwise, it is set to the new value and that value is then returned. A return value of NULL indicates an error of some sort.

If the text domain is not set with textdomain(), the default domain is "messages".

Translating Messages: gettext()

The next step after setting the text domain is to use the gettext() function (or a variant) for every string that should be translated. Several functions provide translation services:

#include <libintl.h>                                            GLIBC

char *gettext(const char *msgid);
char *dgettext(const char *domainname, const char *msgid);
char *dcgettext(const char *domainname, const char *msgid, int category);

The arguments used in these functions are as follows:

const char *msgid

  • The string to be translated. It acts as a key into a database of translations.

const char *domainname

  • The text domain from which to retrieve the translation. Thus, even though main() has called textdomain() to set the application’s own domain, messages can be retrieved from other text domains. (This is most applicable to messages that might be in the text domain for a third-party library, for example.)

int category

  • One of the domain categories described earlier (LC_TIME, etc.).

The default text domain is whatever was set with textdomain() ("messages" if textdomain() was never called). The default category is LC_MESSAGES. Assume that main() makes the following call:

textdomain("killerapp");

Then, ’gettext("my message")’ is equivalent to ’dgettext("killerapp", "my message")’. Both of these, in turn, are equivalent to ’dcgettext("killerapp", "my message", LC_MESSAGES)’.

You will want to use gettext() 99.9 percent of the time. However, the other functions give you the flexibility to work with other text domains or locale categories. You are most likely to need this flexibility when doing library programming, since a standalone library will almost certainly be in its own text domain.

All the functions return a string. The string is either the translation of the given msgid or, if no translation exists, the original string. Thus, there is always some output, even if it’s just the original (presumably English) message. For example:

/* The canonical first program, localized version. */

#include <stdio.h>
#include <locale.h>
#include <libintl.h>

int main(void)
{
    setlocale(LC_ALL, "");
    printf("%s
", gettext("hello, world"));
    return 0;
}

Although the message is a simple string, we don’t use it directly as the printf() control string, since in general, translations can contain % characters.

Shortly, in Section 13.3.4, “Making gettext() Easy to Use,” page 510, we’ll see how to make gettext() easier to use in large-scale, real-world programs.

Working with Plurals: ngettext()

Translating plurals provides special difficulties. Naive code might look like this:

printf("%d word%s misspelled
", nwords, nwords > 1 ? "s" : "");
/* or */
printf("%d %s misspelled
", nwords, nwords == 1 ? "word" : "words");

This is reasonable for English, but translation becomes difficult. First of all, many languages don’t use as simple a plural form as English (adding an s suffix for most words). Second, many languages, particularly in Eastern Europe, have multiple plural forms, each indicating how many objects the form designates. Thus, even code like this isn’t enough:

if (nwords == 1)
    printf("one word misspelled
");
else
    printf("%d words misspelled
", nwords);

The solution is a parallel set of routines specifically for formatting plural values:

#include <libintl.h>                                         GLIBC

char *ngettext(const char *msgid, const char *msgid_plural,
               unsigned long int n);
char *dngettext(const char *domainname, const char *msgid,
                const char *msgid_plural, unsigned long int n);
char *dcngettext(const char *domainname, const char *msgid,
                 const char *msgid_plural, unsigned long int n, int category);

Besides the original msgid argument, these functions accept additional arguments:

const char *msgid_plural

  • The default string to use for plural values. Examples shortly.

unsigned long int n

  • The number of items there are.

Each locale's message catalog specifies how to translate plurals.[8] The ngettext() function (and its variants) examines n and, based on the specification in the message catalog, returns the appropriate translation of msgid. If the catalog does not have a translation for msgid, or in the "C" locale, ngettext() returns msgid if ’n == 1’; otherwise, it returns msgid_plural. Thus, our misspelled words example looks like this:

printf("%s
", ngettext("%d word misspelled", "%d words misspelled", nwords),
       nwords);

Note that nwords must be passed to ngettext() to select a format string, and then to printf() for formatting. In addition, be careful not to use a macro or expression whose value changes each time, like ’n++’! Such a thing could happen if you’re doing global editing to add calls to ngettext() and you don’t pay attention.

Making gettext() Easy to Use

The call to gettext() in program source code serves two purposes. First, it does the translation at runtime, which is the main point, after all. However, it also serves to mark the strings that need translating. The xgettext utility reads program source code and extracts all the original strings that need translation. (We briefly cover the mechanics of this later in the chapter.)

Consider the case, though, of static strings that aren’t used directly:

static char *copyrights[] = {
    "Copyright 2004, Jane Programmer",
    "Permission is granted ...",
    ...                                    LOTS of legalese here
    NULL
};
void copyright(void)
{
    int i;
    for (i = 0; copyrights[i] != NULL, i++)
       printf("%s
", gettext(copyrights[i]));
}

Here, we’d like to be able to print the translations of the copyright strings if they’re available. However, how is the xgettext extractor supposed to find these strings? We can’t enclose them in calls to gettext() because that won’t work at compile time:

/* BAD CODE: won't compile */
static char *copyrights[] = {
    gettext("Copyright 2004, Jane Programmer"),
    gettext("Permission is granted ..."),
    ...                                   LOTS of legalese here
    NULL
};

Portable Programs: "gettext.h"

We assume here that you wish to write a program that can be used along with the GNU gettext library on any Unix system, not just GNU/Linux systems. The next section describes what to do for GNU/Linux-only programs.

The solution to marking strings involves two steps. The first is the use of the gettext.h convenience header that comes in the GNU gettext distribution. This file handles several portability and compilation issues, making it easier to use gettext() in your own programs:

#define ENABLE_NLS 1         ENABLE_NLS must be true for gettext() to work
#include "gettext.h"         Instead of <libintl.h>

If the ENABLE_NLS macro is not defined[9] or it’s set to zero, then gettext.h expands calls to gettext() into the first argument. This makes it possible to port code using gettext() to systems that have neither GNU gettext installed nor their own version. Among other things, this header file defines the following macro:

/* A pseudo function call that serves as a marker for the automated
   extraction of messages, but does not call gettext().  The run-time
   translation is done at a different place in the code.
   The argument, String, should be a literal string.  Concatenated strings
   and other string expressions won't work.
   The macro's expansion is not parenthesized, so that it is suitable as
   initializer for static 'char[]' or 'const char[]' variables.  */
#define gettext_noop(String) String

The comment is self-explanatory. With this macro, we can now proceed to the second step. We rewrite the code as follows:

#define ENABLE_NLS 1
#include "gettext.h"

static char copyrights[] =
    gettext_noop("Copyright 2004, Jane Programmer
"
    "Permission is granted ...
"
    ...                                   LOTS of legalese here
    "So there.");

void copyright(void)
{
   printf("%s
", gettext(copyrights));
}

Note that we made two changes. First, copyrights is now one long string, built up by using the Standard C string constant concatenation feature. This single string is then enclosed in the call to gettext_noop(). We need a single string so that the legalese can be translated as a single entity.

The second change is to print the translation directly, as one string in copyright().

By now, you may be thinking, “Gee, having to type ’gettext(...)’ each time is pretty painful.” Well, you’re right. Not only is it extra work to type, it makes program source code harder to read as well. Thus, once you are using the gettext.h header file, the GNU gettext manual recommends the introduction of two more macros, named_() and N_(), as follows:

#define ENABLE_NLS 1
#include "gettext.h"
#define _(msgid) gettext(msgid)
#define N_(msgid) msgid

This approach reduces the burden of using gettext() to just three extra characters per translatable string constant and only four extra characters for static strings:

#include <stdio.h>
#define ENABLE_NLS 1
#include "gettext.h"
#define _(msgid) gettext(msgid)
#define N_(msgid) msgid
...
static char copyrights[] =
    N_("Copyright 2004, Jane Programmer
"
    "Permission is granted ...
"
    ...                                    LOTS of legalese here
    "So there.");

void copyright(void)
{
    printf("%s
", gettext(copyrights));
}

int main(void)
{
    setlocale(LC_ALL, "");       /* gettext.h gets <locale.h> for us too */
    printf("%s
", _("hello, world"));
    copyright();
    exit(0);
}

These macros are unobtrusive, and in practice, all GNU programs that use GNU gettext use this convention. If you intend to use GNU gettext, you too should follow this convention.

GLIBC Only: <libintl.h>

For a program that will only be used on systems with GLIBC, the header file usage and macros are similar, but simpler:

#include <stdio.h>
#include <libintl.h>
#define _(msgid) gettext(msgid)
#define N_(msgid) msgid
...everything else is the same ...

As we saw earlier, the <libintl.h> header file declares gettext() and the other functions. You still have to define _() and N_(), but you don’t have to worry about ENABLE_NLS, or distributing gettext.h with your program’s source code.

Rearranging Word Order with printf()

When translations are produced, sometimes the word order that is natural in English is incorrect for other languages. For instance, while in English an adjective appears before the noun it modifies, in many languages it appears after the noun. Thus, code like the following presents a problem:

char *animal_color, *animal;

if (...) {
    animal_color = _("brown");
    animal = _("cat");
} else if (...) {
    ...
} else {
    ...
}
printf(_("The %s %s looks at you enquiringly.
"), animal_color, color);

Here, the format string, animal_color and animal are all properly enclosed in calls to gettext(). However, the statement will still be incorrect when translated, since the order of the arguments cannot be changed at runtime.

To get around this, the POSIX (but not ISO C) version of the printf() family allows you to provide a positional specifier within a format specifier. This takes the form of a decimal number followed by a $ character immediately after the initial % character. For example:

printf("%2$s, %1$s
", "world", "hello");

The positional specifier indicates which argument in the argument list to use; counts begin at 1 and don’t include the format string itself. This example prints the famous ’hello, world’ message in the correct order.

GLIBC and Solaris implement this capability. As it’s part of POSIX, if your Unix vendor’s printf() doesn’t have it, it should be appearing soon.

Any of the regular printf() flags, a field width, and a precision may follow the positional specifier. The rules for using positional specifiers are these:

  • The positional specifier form may not be mixed with the nonpositional form. In other words, either every format specifier includes a positional specifier or none of them do. Of course, %% can always be used.

  • If the N’th argument is used in the format string, all the arguments up to N must also be used by the string. Thus, the following is invalid:

    printf("%3$s %1$s
    ", "hello", "cruel", "world");
    
  • A particular argument may be referenced with a positional specifier multiple times. Nonpositional format specifications always move through the argument list sequentially.

This facility isn’t intended for direct use by application programmers, but rather by translators. For example, a French translation for the previous format string, "The %s %s looks at you enquiringly. ", might be:

"Le %2$s %1$s te regarde d'un aire interrogateur.
"

(Even this translation isn’t perfect: the article “Le” is gender specific. Preparing a program for translation is a hard job!)

Testing Translations in a Private Directory

The collection of messages in a program is referred to as the message catalog. This term also applies to each translation of the messages into a different language. When a program is installed, each translation is also installed in a standard location, where gettext() can find the right one at runtime.

It can be useful to place translations in a directory other than the standard one, particularly for program testing. Especially on larger systems, a regular developer probably does not have the permissions necessary to install files in system directories. The bindtextdomain() function gives gettext() an alternative place to look for translations:

#include <libintl.h>                                             GLIBC

char *bindtextdomain(const char *domainname, const char *dirname);

Useful directories include ’.’ for the current directory and /tmp. It might also be handy to get the directory from an environment variable, like so:

char *td_dir;

setlocale(LC_ALL, "");
textdomain("killerapp");
if ((td_dir = getenv("KILLERAPP_TD_DIR")) != NULL)
    bindtextdomain("killerapp", td_dir);

bindtextdomain() should be called before any calls to the gettext() family of functions. We see an example of how to use it in Section 13.3.8, “Creating Translations,” page 517.

Preparing Internationalized Programs

So far, we’ve looked at all the components that go into an internationalized program. This section summarizes the process.

  1. Adopt the gettext.h header file into your application, and add definitions for the _() and N_() macros to a header file that is included by all your C source files. Don’t forget to define the ENABLE_NLS symbolic constant.

  2. Call setlocale() as appropriate. It is easiest to call ’setlocale(LC_ALL, "")’, but occasionally an application may need to be more picky about which locale categories it enables.

  3. Pick a text domain for the application, and set it with textdomain().

  4. If testing, bind the text domain to a particular directory with bindtextdomain().

  5. Use strfmon(), strftime(), and the ' flag for printf() as appropriate. If other locale information is needed, use nl_langinfo(), particularly in conjunction with strftime().

  6. Mark all strings that should be translated with calls to _() or N_(), as appropriate.

    A few should not be so marked though. For example, if you use getopt_long() (see Section 2.1.2, “GNU Long Options,” page 27), you probably don’t want the long option names to be marked for translation. Also, simple format strings like "%d %d " don’t need to be translated, nor do debugging messages.

  7. When appropriate, use ngettext() (or its variants) for dealing with values that can be either 1 or greater than 1.

  8. Make life easier for your translators by using multiple strings representing complete sentences instead of doing word substitutions with %s and ?:. For example:

if (an error occurred) {        /* RIGHT */
    /* Use multiple strings to make translation easier. */
    if (input_type == INPUT_FILE)
        fprintf(stderr, _("%s: cannot read file: %s
"),
                         argv[0], strerror(errno));
    else
        fprintf(stderr, _("%s: cannot read pipe: %s
"),
                         argv[0], strerror(errno));
}

This is better than

if (an error occurred) {        /* WRONG */
    fprintf(stderr, _("%s: cannot read %s: %s
"), argv[0],
            input_type == INPUT_FILE ? _("file") : _("pipe"),
            strerror(errno));
}

As just shown, it’s a good idea to include a comment stating that there are multiple messages on purpose—to make it easier to translate the messages.

Creating Translations

Once your program has been internationalized, it’s necessary to prepare translations. This is done with several shell-level tools. We start with an internationalized version of ch06-echodate.c, from Section 6.1.4, “Converting a Broken-Down Time to a time_t,” page 176:

/* ch13-echodate.c --- demonstrate translations */

#include <stdio.h>
#include <time.h>
#include <locale.h>
#define ENABLE_NLS 1
#include "gettext.h"
#define _(msgid) gettext(msgid)
#define N_(msgid) msgid

int main(void)
{
    struct tm tm;
    time_t then;

    setlocale(LC_ALL, "");
    bindtextdomain("echodate", ".");
    textdomain("echodate");

    printf("%s", _("Enter a Date/time as YYYY/MM/DD HH:MM:SS : "));
    scanf("%d/%d/%d %d:%d:%d",
        & tm.tm_year, & tm.tm_mon, & tm.tm_mday,
        & tm.tm_hour, & tm.tm_min, & tm.tm_sec);

    /* Error checking on values omitted for brevity. */
    tm.tm_year -= 1900;
    tm.tm_mon -= 1;

    tm.tm_isdst = -1;   /* Don't know about DST */

    then = mktime(& tm);

    printf(_("Got: %s"), ctime(& then));
    exit(0);
}

We have purposely used "gettext.h" and not <gettext.h>. If our application ships with a private copy of the gettext library, then "gettext.h" will find it, avoiding the system’s copy. On the other hand, if there is only a system copy, it will be found if there is no local copy. The situation is admittedly complicated by the fact that Solaris systems also have a gettext library which is not as featureful as the GNU version.

Moving on to creating translations, the first step is to extract the translatable strings. This is done with the xgettext program:

$ xgettext --keyword=_ --keyword=N_ 
> --default-domain=echodate ch13-echodate.c

The --keyword options tell xgettext to look for the _() and N_() macros. It already knows to extract strings from gettext() and its variants, as well as from gettext_noop().

The output from xgettext is called a portable object file. The default filename is messages.po, corresponding to the default text domain of "messages". The --default-domain option indicates the text domain, for use in naming the output file. In this case, the file is named echodate.po. Here are its contents:

# SOME DESCRIPTIVE TITLE.                               Boilerplate, to be edited
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""                                                Detailed information
msgstr ""                                               Each translator completes
"Project-Id-Version: PACKAGE VERSION
"
"Report-Msgid-Bugs-To: 
"
"POT-Creation-Date: 2003-07-14 18:46-0700
"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE
"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>
"
"Language-Team: LANGUAGE <[email protected]>
"
"MIME-Version: 1.0
"
"Content-Type: text/plain; charset=CHARSET
"
"Content-Transfer-Encoding: 8bit
"

#: ch13-echodate.c:19                                   Message location
msgid "Enter a Date/time as YYYY/MM/DD HH:MM:SS : "     Original message
msgstr ""                                               Translation goes here

#: ch13-echodate.c:32                                   Same for each message
#, c-format
msgid "Got: %s"
msgstr ""

This original file is reused for each translation. It is thus a template for translations, and by convention it should be renamed to reflect this fact, with a .pot (portable object template) suffix:

$ mv echodate.po echodate.pot

Given that we aren’t fluent in many languages, we have chosen to translate the messages into pig Latin. Thus, the next step is to produce a translation. We do this by copying the template file and adding translations to the new copy:

$ cp echodate.pot piglat.po
$ vi piglat.po                   Add translations, use your favorite editor

The filename convention is language.po where language is the two- or three-character international standard abbreviation for the language. Occasionally the form language_country.po is used: for example, pt_BR.po for Portugese in Brazil. As pig Latin isn’t a real language, we’ve called the file piglat.po. Here are the contents, after the translations have been added:

# echodate translations into pig Latin
# Copyright (C) 2004 Prentice-Hall
# This file is distributed under the same license as the echodate package.
# Arnold Robbins <[email protected]> 2004
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: echodate 1.0
"
"Report-Msgid-Bugs-To: [email protected]
"
"POT-Creation-Date: 2003-07-14 18:46-0700
"
"PO-Revision-Date: 2003-07-14 19:00+8
"
"Last-Translator: Arnold Robbins <[email protected]>
"
"Language-Team: Pig Latin <[email protected]>
"
"MIME-Version: 1.0
"
"Content-Type: text/plain; charset=ASCII
"
"Content-Transfer-Encoding: 8bit
"

#: ch13-echodate.c:19
msgid "Enter a Date/time as YYYY/MM/DD HH:MM:SS : "
msgstr "Enteray A Ateday/imetay asay YYYY/MM/DD HH:MM:SS : "

#: ch13-echodate.c:32
#, c-format
msgid "Got: %s"
msgstr "Otgay: %s"

While it would be possible to do a linear search directly in the portable object file, such a search would be slow. For example, gawk has approximately 350 separate messages, and the GNU Coreutils have over 670. Linear searching a file with hundreds of messages would be noticeably slow. Therefore, GNU gettext uses a binary format for fast message lookup. msgfmt does the compilation, producing a message object file:

$ msgfmt piglat.po -o piglat.mo

As program maintenance is done, the strings used by a program change: new strings are added, others are deleted or changed. At the very least, a string’s location in the source file may move around. Thus, translation .po files will likely get out of date. The msgmerge program merges an old translation file with a new .pot file. The result can then be updated. This example does a merge and then recompiles:

$ msgmerge piglat.po echodate.pot -o piglat.new.po   Merge files
$ mv piglat.new.po piglat.po                         Rename the result
$ vi piglat.po                                       Bring translations up to date
$ msgfmt piglat.po -o piglat.mo                      Recreate .mo file

Compiled .mo files are placed in the file base/locale/category/textdomain.mo. On GNU/Linux systems, base is /usr/share/locale. locale is the language, such as ’es’, ’fr’, and so on. category is a locale category; for messages, it is LC_MESSAGES. textdomain is the text domain of the program: in our case, echodate. As a real example, the GNU Coreutils Spanish translation is in /usr/share/locale/es/LC_MESSAGES/coreutils.mo.

The bindtextdomain() function changes the base part of the location. In ch13-echodate.c, we change it to ’.’. Thus, it’s necessary to make the appropriate directories, and place the pig Latin translation there:

$ mkdir -p en_US/LC_MESSAGES                      Have to use a real locale
$ cp piglat.mo en_US/LC_MESSAGES/echodate.mo      Put the file in the right place

A real locale must be used;[10] thus, we “pretend” by using "en_US". With the translation in place, we set LC_ALL appropriately, cross our fingers, and run the program:

$ LC_ALL=en_US ch13-echodate                      Run the program
Enteray A Ateday/imetay asay YYYY/MM/DD HH:MM:SS : 2003/07/14 21:19:26
Otgay: Mon Jul 14 21:19:26 2003

The latest version of GNU gettext can be found in the GNU gettext distribution directory.[11]

This section has necessarily only skimmed the surface of the localization process. GNU gettext provides many tools for working with translations, and in particular for making it easy to keep translations up to date as program source code evolves.

The manual process for updating translations is workable but tedious. This task is easily automated with make; in particular GNU gettext integrates well with Autoconf and Automake to provide this functionality, removing considerable development burden from the programmer.

We recommend reading the GNU gettext documentation to learn more about both of these issues in particular and about GNU gettext in general.

Can You Spell That for Me, Please?

In the very early days of computing, different systems assigned different correspondences between numeric values and glyphs—symbols such as letters, digits, and punctuation used for communication with humans. Eventually, two widely used standards emerged: the EBCDIC encoding used on IBM and workalike mainframes, and ASCII, used on everything else. Today, except on mainframes, ASCII is the basis for all other character sets currently in use.

The original seven-bit ASCII character set suffices for American English and most punctuation and special characters (such as $, but there is no character for the “cent” symbol). However, there are many languages and many countries with different character set needs. ASCII doesn’t handle the accented versions of the roman characters used in Europe, and many Asian languages have thousands of characters. New technologies have evolved to solve these deficiencies.

The i18n literature abounds with references to three fundamental terms. Once we define them and their relationship to each other, we can present a general description of the corresponding C APIs.

Character set

  • A definition of the meaning assigned to different integer values; for example, that A is 65. Any character set that uses more than eight bits per character is termed a multibyte character set.

Character set encoding

  • ASCII uses a single byte to represent characters. Thus, the integer value is stored as itself, directly in disk files. More recent character sets, most notably different versions of Unicode,[12] use 16-bit or even 32-bit integer values for representing characters. For most of the defined characters, one, two, or even three of the higher bytes in the integer are zero, making direct storage of their values in disk files expensive. The encoding describes a mechanism for converting 16- or 32-bit values into one to six bytes for storage on disk, such that overall there is a significant space savings.

Language

  • The rules for a given language dictate character set usage. In particular, the rules affect the ordering of characters. For example, in French, e, é, and è should all come between d and f, no matter what numerical values are assigned to those characters. Different languages can (and do) assign different orderings to the same glyphs.

Various technologies have evolved over time for supporting multibyte character sets. Computing practice is slowly converging on Unicode and its encoding, but Standard C and POSIX support both past and present techniques. This section provides a conceptual overview of the various facilities. We have not had to use them ourselves, so we prefer to merely introduce them and provide pointers to more information.

Wide Characters

We start with the concept of a wide character. A wide character is an integer type that can hold any value of the particular multibyte character set being used.

Wide characters are represented in C with the type wchar_t. C99 provides a corresponding wint_t type, which can hold any value that a wchar_t can hold, and the special value WEOF, which is analogous to regular EOF from <stdio.h>. The various types are defined in the <wchar.h> header file. A number of functions similar to those of <ctype.h> are defined by the <wctype.h> header file, such as iswalnum(), and many more.

Wide characters may be 16 to 32 bits in size, depending on the implementation. As mentioned, they’re intended for manipulating data in memory and are not usually stored directly in files.

For wide characters, the C standard provides a large number of functions and macros that correspond to the traditional functions that work on char data. For example, wprintf(), iswlower(), and so on. These are documented in the GNU/Linux manpages and in books on Standard C.

Multibyte Character Encodings

Strings of wide characters are stored on disk by being converted to a multibyte character set encoding in memory, and the converted data is then written to a disk file. Similarly, such strings are read in from disk through low-level block I/O, and converted in memory from the encoded version to the wide-character version.

Many defined encodings represent multibyte characters by using shift states. In other words, given an input byte stream, byte values represent themselves until a special control value is encountered. At that point, the interpretation changes according to the current shift state. Thus, the same eight-bit value can have two meanings: one for the normal, unshifted state, and another for the shifted state. Correctly encoded strings are supposed to start and end in the same shift state.

A significant advantage to Unicode is that its encodings are self-correcting; the encondings don’t use shift states, so a loss of data in the middle does not corrupt the subsequent encoded data.

The initial versions of the multibyte-to-wide-character and wide-character-to-multibyte functions maintained a private copy of the state of the translation (for example, the shift state, and anything else that might be necessary). This design limits the functions’ use to one kind of translation throughout the life of the program. Examples are mblen() (multibyte-character length), mbtowc() (multibyte to wide character), and wctomb() (wide character to multibyte), mbstowcs() (multibyte string to wide-character string), and wcstombs() (wide-character string to multibyte string).

The newer versions of these routines are termed restartable. This means that the userlevel code maintains the state of the translation in a separate object, of type mbstate_t. The corresponding examples are mbrlen(), mbrtowc(), and wcrtomb(), mbsrtowcs() and wcsrtombs(). (Note the r, for “restartable,” in their names.)

Languages

Language issues are controlled by the locale. We’ve already seen setlocale() earlier in the chapter. POSIX provides an elaborate mechanism for defining the rules by which a locale works; see the GNU/Linux locale(5) manpage for some of the details and the POSIX standard itself for the full story.

The truth is, you really don’t want to know the details. Nor should you, as an application developer, need to worry about them; it is up to the library implementors to make things work. All you need to do is understand the concepts and make your code use the appropriate functions, such as strcoll() (see Section 13.2.3, “String Collation: strcoll() and strxfrm(),” page 490).

Current GLIBC systems provide excellent locale support, including a multibyte-aware suite of regular expression matching routines. For example, the POSIX extended regular expression [[:alpha:]] [[:alnum:]]+ matches a letter followed by one or more letters or digits (an alphabetic character followed by one or more alphanumeric ones). The definition of which characters matches these classes depends on the locale. For example, this regular expression would match the two characters ’’, whereas the traditional Unix, ASCII-oriented regular expression [a-zA-Z] [a-A-Zz0-9]+ most likely would not. The POSIX character classes are listed in Table 13.5.

Table 13.5. POSIX regular expression character classes

Class name

Matches

[:alnum:]

Alphanumeric characters.

[:alpha:]

Alphabetic characters.

[:blank:]

Space and TAB characters.

[:cntrl:]

Control characters.

[:digit:]

Numeric characters.

[:graph:]

Characters that are both printable and visible. (A newline is printable but not visible, whereas a $ is both.)

[:lower:]

Lowercase alphabetic characters.

[:print:]

Printable characters (not control characters).

[:punct:]

Punctuation characters (not letters, digits, control characters, or space characters).

[:space:]

Space characters (such as space, TAB, newline, and so on).

[:upper:]

Uppercase alphabetic characters.

[:xdigit:]

Characters from the set abcdefABCDEF0123456789.

Conclusion

You may never have to deal with different character sets and encodings. On the other hand, the world is rapidly becoming a “global village,” and software authors and vendors can’t afford to be parochial. It pays, therefore, to be aware of internationalization issues and character set issues and the way in which they affect your system’s behavior. Already, at least one vendor of GNU/Linux distributions sets the default locale to be en_US.UTF-8 for systems in the United States.

Suggested Reading

  1. C, A Reference Manual, 5th edition, by Samuel P. Harbison III and Guy L. Steele, Jr., Prentice-Hall, Upper Saddle River, New Jersey, USA, 2002. ISBN: 0-13-089592-X.

    We have mentioned this book before. It provides a concise and comprehensible description of the evolution and use of the multibyte and wide-character facilities in the C standard library. This is particularly valuable on modern systems supporting C99 because the library was significantly enhanced for the 1999 C standard.

  2. GNU gettext tools, by Ulrich Drepper, Jim Meyering, François Pinard, and Bruno Haible. This is the manual for GNU gettext. On a GNU/Linux system, you can see the local copy with ’info gettext’. Or download and print the latest version (from ftp://ftp.gnu.org/gnu/gettext/).

Summary

  • Program internationalization and localization fall under the general heading of native language support. i18n, l10n, and NLS are popular acronyms. The central concept is the locale, which customizes the character set, date, time, and monetary and numeric information for the current language and country.

  • Locale awareness must be enabled with setlocale(). Different locale categories provide access to the different kinds of locale information. Locale-unaware programs act as if they were in the "C" locale, which produces results typical of Unix systems before NLS: 7-bit ASCII, English names for months and days, and so on. The "POSIX" locale is equivalent to the "C" one.

  • Locale-aware string comparisons are done with strcoll() or with the combination of strxfrm() and strcmp(). Library facilities provide access to locale information (localeconv() and nl_langinfo()) as well as locale-specific information formatting (strfmon(), strftime(), and printf()).

  • The flip side of retrieving locale-related information is producing messages in the local language. The System V catgets() design, while standardized by POSIX, is difficult to use and not recommended.[13] Instead, GNU gettext implements and extends the original Solaris design.

  • With gettext(), the original English message string acts as a key into a binary translation file from which to retrieve the string’s translation. Each application specifies a unique text domain so that gettext() can find the correct translation file (known as a “message catalog”). The text domain is set with textdomain(). For testing, or as otherwise needed, the location for message catalogs can be changed with bindtextdomain().

  • Along with gettext(), variants provide access to translations in different text domains or different locale categories. Additionally, the ngettext() function and its variants enable correct plural translations without overburdening the developer. The positional specifier within printf() format specifiers enables translation of format strings where arguments need to be printed in a different order from the one in which they’re provided.

  • In practice, GNU programs use the gettext.h header file and _() and N_() macros for marking translatable strings in their source files. This practice keeps program source code readable and maintainable while still providing the benefits of i18n and l10n.

  • GNU gettext provides numerous tools for the creation and management of translation databases (portable object files) and their binary equivalents (message object files).

  • Finally, it pays to be aware of character set and encoding issues. Software vendors can no longer afford to assume that their users are willing to work in only one language.

Exercises

  1. Does your system support locales? If so, what is the default locale?

  2. Look at the locale(1) manpage if you have it. How many locales are there if you count them with ’locale -a | wc -l’?

  3. Experiment with ch13-strings.c, ch13-lconv.c, ch13-strfmon.c, ch13-quoteflag.c, and ch13-times.c in different locales. What is the most “unusual” locale you can find, and why?

  4. Take one of your programs. Internationalize it to use GNU gettext. Try to find someone who speaks another language to translate the messages for you. Compile the translation, and test it by using bindtextdomain(). What was your translator’s reaction upon seeing the translations in use?



[1] An earlier design, known as catgets(), exists. Although this design is standardized by POSIX, it is much harder to use, and we don’t recommend it.

[2] Long-time C and Unix programmers may prefer to use the “C” locale, even if they are native English speakers; the English locales produce different results from what grizzled, battle-scarred Unix veterans expect.

[3] We’re as happy as you are, since we don’t have to provide example code that uses this, er, full-featured struct.

[4] The technical term used in the standards is radix point, since numbers in different bases may have fractional parts as well. However, for monetary values, it seems pretty safe to use the term “decimal point.”

[5] We probably should have chosen more descriptive names than just ii and jj. Since the code that uses them is short, our lack of imagination is not a significant problem.

[6] This is set by the Autoconf and Automake machinery. Autoconf and Automake are powerful software suites that make it possible to support a wide range of Unix systems in a systematic fashion.

[7] Although Americans often refer to the eras of particular presidents, these are not a formal part of the national calendar in the same sense as in pre-World War II Japan or pre-Communist China.

[8] The details are given in the GNU gettext documentation. Here, we’re focusing on the developer’s needs, not the translator’s.

[9] This macro is usually automatically defined by the configure program, either in a special header or on the compiler command line. configure is generated with Autoconf and Automake.

[10] We spent a frustrating 30 or 45 minutes attempting to use a piglat/LC_MESSAGES directory and setting ’LC_ALL=piglat’, all to no effect, until we figured this out.

[13] GNU/Linux supports it, but only for compatibility.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.181.146