Once upon a time, computer makers wrote software that worked only in English. Eventually, hardware and software vendors noticed that if their systems worked in the native languages of non-English-speaking countries, they were able to sell more systems. As a result, internationalization and localization of programs and software systems became a common practice.
For many years, the ability to provide internationalization was
largely restricted to programs written in C and C++. This chapter describes
the underlying library gawk
uses for
internationalization, as well as how gawk
makes internationalization features available at the awk
program level. Having internationalization
available at the awk
level gives software
developers additional flexibility—they are no longer forced to write in C or
C++ when internationalization is a
requirement.
Internationalization means writing (or modifying) a program once, in such a way that it can use multiple languages without requiring further source code changes. Localization means providing the data necessary for an internationalized program to work in a particular language. Most typically, these terms refer to features such as the language used for printing error messages, the language used to read responses, and information related to how numerical and monetary values are printed and read.
gawk
uses GNU gettext
to provide its internationalization
features. The facilities in GNU gettext
focus on messages: strings printed by a
program, either directly or via formatting with printf
or sprintf()
.[82]
When using GNU gettext
, each
application has its own text domain. This is a unique name, such as ‘kpilot
’ or ‘gawk
’, that identifies the application. A
complete application may have multiple components—programs written in C or
C++, as well as scripts written in sh
or awk
. All of the components use the
same text domain.
To make the discussion concrete, assume we’re writing an application
named guide
. Internationalization
consists of the following steps, in this order:
The programmer reviews the source for all of guide
’s components and marks each string
that is a candidate for translation. For example, "`-F': option required"
is a good candidate
for translation. A table with strings of option names is not (e.g.,
gawk
’s --profile
option should remain the same, no matter what the local
language).
The programmer indicates the application’s text domain ("guide"
) to the gettext
library, by calling the textdomain()
function.
Messages from the application are extracted from the source code
and collected into a portable object template file (guide.pot
), which lists the strings and
their translations. The translations are initially empty. The original
(usually English) messages serve as the key for lookup of the
translations.
For each language with a translator, guide.pot
is copied to a portable object
file (.po
) and translations are
created and shipped with the application. For example, there might be
a fr.po
for a French
translation.
Each language’s .po
file is
converted into a binary message object (.gmo
) file. A message object file contains
the original messages and their translations in a binary format that
allows fast lookup of translations at runtime.
When guide
is built and
installed, the binary translation files are installed in a standard
place.
For testing and development, it is possible to tell gettext
to use .gmo
files in a different directory than
the standard one by using the bindtextdomain()
function.
At runtime, guide
looks up
each string via a call to gettext()
. The returned string is the
translated string if available, or the original string if not.
If necessary, it is possible to access messages from a different text domain than the one belonging to the application, without having to switch the application’s default text domain back and forth.
In C (or C++), the string marking and dynamic translation lookup are
accomplished by wrapping each string in a call to gettext()
:
printf("%s", gettext("Don't Panic! "));
The tools that extract messages from source code pull out all
strings enclosed in calls to gettext()
.
The GNU gettext
developers,
recognizing that typing ‘gettext(…)
’
over and over again is both painful and ugly to look at, use the macro ‘_
’ (an underscore) to make things easier:
/* In the standard header file: */ #define _(str) gettext(str) /* In the program text: */ printf("%s", _("Don't Panic! "));
This reduces the typing overhead to just three extra characters per string and is considerably easier to read as well.
There are locale categories for different
types of locale-related information. The defined locale
categories that gettext
knows about
are:
LC_MESSAGES
Text messages. This is the default category for gettext
operations, but it is possible to
supply a different one explicitly, if necessary. (It is almost never
necessary to supply a different category.)
LC_COLLATE
Text-collation information (i.e., how different characters and/or groups of characters sort in a given language).
LC_CTYPE
Character-type information (alphabetic, digit, upper- or
lowercase, and so on) as well as character encoding. This
information is accessed via the POSIX character classes in regular
expressions, such as /[[:alnum:]]/
(see Using Bracket Expressions).
LC_MONETARY
Monetary information, such as the currency symbol, and whether the symbol goes before or after a number.
LC_NUMERIC
Numeric information, such as which characters to use for the decimal point and the thousands separator.[83]
LC_TIME
Time- and date-related information, such as 12- or 24-hour clock, month printed before or after the day in a date, local month abbreviations, and so on.
LC_ALL
All of the above. (Not too useful in the context of gettext
.)
gawk
provides the following
variables for internationalization:
TEXTDOMAIN
This variable indicates the application’s text domain.
For compatibility with GNU gettext
, the default value is "messages"
.
_"your message here"
String constants marked with a leading underscore are candidates for translation at runtime. String constants without a leading underscore are not translated.
gawk
provides the following
functions for internationalization:
dcgettext(string
[,
domain
[,
category
]])
Return the translation of
string
in text domain
domain
for locale category
category
. The default value for
domain
is the current value of TEXTDOMAIN
. The default value for
category
is "LC_MESSAGES"
.
If you supply a value for category
,
it must be a string equal to one of the known locale categories
described in the previous section. You must also supply a text
domain. Use TEXTDOMAIN
if you
want to use the current domain.
The order of arguments to the awk
version of the dcgettext()
function is purposely
different from the order for the C version. The awk
version’s order was chosen to be
simple and to allow for reasonable awk
-style default arguments.
dcngettext(string1
,
string2
,
number
[,
domain
[,
category
]])
Return the plural form used for number
of the
translation of string1
and
string2
in text domain
domain
for locale category
category
.
string1
is the English singular variant
of a message, and string2
is the English
plural variant of the same message. The default value for
domain
is the current value of TEXTDOMAIN
. The default value for
category
is "LC_MESSAGES"
.
The same remarks about argument order as for the dcgettext()
function apply.
bindtextdomain(directory
[,
domain
])
Change the directory in which gettext
looks for .gmo
files, in case they will not or cannot be placed in the standard
locations (e.g., during testing). Return the directory
in which domain
is “bound.”
The default domain
is the value of
TEXTDOMAIN
. If
directory
is the null string (""
), then bindtextdomain()
returns the current
binding for the given domain
.
To use these facilities in your awk
program, follow these steps:
Set the variable TEXTDOMAIN
to the text domain of your program. This is best done in a BEGIN
rule (see The BEGIN and END Special Patterns), or it can also be done via the
-v
command-line option (see Command-Line Options):
BEGIN { TEXTDOMAIN = "guide" … }
Mark all translatable strings with a leading underscore
(‘_
’) character. It
must be adjacent to the opening quote of the
string. For example:
print _"hello, world" x = _"you goofed" printf(_"Number of users is %d ", nusers)
If you are creating strings dynamically, you can still translate
them, using the dcgettext()
built-in function:[84]
if (groggy) message = dcgettext("%d customers disturbing me ", "adminprog") else message = dcgettext("enjoying %d customers ", "adminprog") printf(message, ncustomers)
Here, the call to dcgettext()
supplies a different text domain ("adminprog"
) in which to find the message,
but it uses the default "LC_MESSAGES"
category.
The previous example only works if ncustomers
is greater than one. This example
would be better done with dcngettext()
:
if (groggy) message = dcngettext("%d customer disturbing me ", "%d customers disturbing me ", "adminprog") else message = dcngettext("enjoying %d customer ", "enjoying %d customers ", "adminprog") printf(message, ncustomers)
During development, you might want to put the .gmo
file in a private directory for
testing. This is done with the bindtextdomain()
built-in function:
BEGIN { TEXTDOMAIN = "guide" # our text domain if (Testing) { # where to find our files bindtextdomain("testdir") # joe is in charge of adminprog bindtextdomain("../joe/testdir", "adminprog") } … }
See A Simple Internationalization Example for an example program showing
the steps to create and use translations from awk
.
Once a program’s translatable strings have been marked, they must be
extracted to create the initial .pot
file. As part of translation, it is often helpful to rearrange
the order in which arguments to printf
are output.
gawk
’s --gen-pot
command-line option extracts the messages and is discussed next. After that, printf
’s ability to rearrange the order for
printf
arguments at runtime is
covered.
Once your awk
program is
working, and all the strings have been marked and you’ve set (and
perhaps bound) the text domain, it is time to produce translations. First, use the
--gen-pot
command-line option to create the initial
.pot
file:
gawk --gen-pot -f guide.awk > guide.pot
When run with --gen-pot
, gawk
does not execute your program. Instead,
it parses it as usual and prints all marked strings to standard output
in the format of a GNU gettext
Portable Object file. Also included in the output are any constant
strings that appear as the first argument to dcgettext()
or as the first and second
argument to dcngettext()
.[85] You should distribute the generated .pot
file with your awk
program; translators will eventually use
it to provide you translations that you can also then distribute. See
A Simple Internationalization Example for the full list of steps to go through
to create and test translations for guide
.
Format strings for printf
and
sprintf()
(see Using printf Statements for Fancier Printing) present a special problem for translation. Consider the following:[86]
printf(_"String `%s' has %d characters ", string, length(string)))
A possible German translation for this might be:
"%d Zeichen lang ist die Zeichenkette `%s' "
The problem should be obvious: the order of the format
specifications is different from the original! Even though gettext()
can return the translated string at
runtime, it cannot change the argument order in the call to printf
.
To solve this problem, printf
format specifiers may have an additional optional element, which we
call a positional specifier. For
example:
"%2$d Zeichen lang ist die Zeichenkette `%1$s' "
Here, the positional specifier consists of an integer count, which
indicates which argument to use, and a ‘$
’. Counts are one-based, and the format
string itself is not included. Thus, in the
following example, ‘string
’ is the
first argument and ‘length(string)
’
is the second:
$gawk 'BEGIN {
>string = "Don47t Panic"
>printf "%2$d characters live in "%1$s" ",
>string, length(string)
>}'
11 characters live in "Don't Panic"
If present, positional specifiers come first in the format specification, before the flags, the field width, and/or the precision.
Positional specifiers can be used with the dynamic field width and precision capability:
$gawk 'BEGIN {
>printf("%*.*s ", 10, 20, "hello")
>printf("%3$*2$.*1$s ", 20, 10, "hello")
>}'
hello hello
When using ‘*
’ with a
positional specifier, the ‘*
’ comes
first, then the integer position, and then the ‘$
’. This is somewhat
counterintuitive.
gawk
does not allow you to mix
regular format specifiers and those with positional specifiers in the
same string:
$ gawk 'BEGIN { printf "%d %3$s
", 1, 2, "hi" }'
error→ gawk: cmd. line:1: fatal: must use `count$' on all formats or none
There are some pathological cases that gawk
may fail to diagnose. In such cases,
the output may not be what you expect. It’s still a bad idea to try
mixing them, even if gawk
doesn’t
detect it.
Although positional specifiers can be used directly in awk
programs, their primary purpose is to help
in producing correct translations of format strings into languages
different from the one in which the program is first written.
gawk
’s internationalization
features were purposely chosen to have as little impact as possible on the portability of awk
programs that use them to other versions
of awk
. Consider this program:
BEGIN { TEXTDOMAIN = "guide" if (Test_Guide) # set with -v bindtextdomain("/test/guide/messages") print _"don't panic!" }
As written, it won’t work on other versions of awk
. However, it is actually almost portable,
requiring very little change:
Assignments to TEXTDOMAIN
won’t have any effect, because TEXTDOMAIN
is not special in other
awk
implementations.
Non-GNU versions of awk
treat marked strings as the concatenation of a variable named
_
with the string following
it.[87] Typically, the variable _
has the null string (""
) as its value, leaving the original
string constant as the result.
By defining “dummy” functions to replace dcgettext()
, dcngettext()
, and bindtextdomain()
, the awk
program can be made to run, but all
the messages are output in the original language. For
example:
function bindtextdomain(dir, domain) { return dir } function dcgettext(string, domain, category) { return string } function dcngettext(string1, string2, number, domain, category) { return (number == 1 ? string1 : string2) }
The use of positional specifications in printf
or sprintf()
is not
portable. To support gettext()
at the C level, many systems’ C
versions of sprintf()
do support
positional specifiers. But it works only if enough arguments are
supplied in the function call. Many versions of awk
pass printf
formats and arguments unchanged to
the underlying C library version of sprintf()
, but only one format and
argument at a time. What happens if a positional specification is
used is anybody’s guess. However, because the positional
specifications are primarily for use in
translated format strings, and because non-GNU
awk
s never retrieve the
translated string, this should not be a problem in
practice.
Now let’s look at a step-by-step example of how to internationalize
and localize a simple awk
program,
using guide.awk
as our original source:
BEGIN { TEXTDOMAIN = "guide" bindtextdomain(".") # for testing print _"Don't Panic" print _"The Answer Is", 42 print "Pardon me, Zaphod who?" }
Run ‘gawk --gen-pot
’ to create
the .pot
file:
$ gawk --gen-pot -f guide.awk > guide.pot
This produces:
#: guide.awk:4 msgid "Don't Panic" msgstr "" #: guide.awk:5 msgid "The Answer Is" msgstr ""
This original portable object template file is saved and reused for
each language into which the application is translated. The msgid
is the original string and the msgstr
is the translation.
Strings not marked with a leading underscore do not appear in the
guide.pot
file.
Next, the messages must be translated. Here is a translation to a hypothetical dialect of English, called “Mellow”:[88]
$cp guide.pot guide-mellow.po
Add translations to
guide-mellow.po …
Following are the translations:
#: guide.awk:4 msgid "Don't Panic" msgstr "Hey man, relax!" #: guide.awk:5 msgid "The Answer Is" msgstr "Like, the scoop is"
The next step is to make the directory to hold the binary message
object file and then to create the guide.mo
file. We pretend that our file is to
be used in the en_US.UTF-8
locale,
because we have to use a locale name known to the C gettext
routines. The directory layout shown
here is standard for GNU gettext
on
GNU/Linux systems. Other versions of gettext
may use a different layout:
$ mkdir en_US.UTF-8 en_US.UTF-8/LC_MESSAGES
The msgfmt
utility does the conversion from human-readable .po
file to machine-readable .mo
file. By default, msgfmt
creates a file named messages
. This file must be renamed and placed
in the proper directory (using the -o
option) so that
gawk
can find it:
$ msgfmt guide-mellow.po -o en_US.UTF-8/LC_MESSAGES/guide.mo
Finally, we run the program to test it:
$ gawk -f guide.awk
Hey man, relax!
Like, the scoop is 42
Pardon me, Zaphod who?
If the three replacement functions for dcgettext()
, dcngettext()
, and bindtextdomain()
(see awk Portability Issues) are in a file named libintl.awk
, then we can run guide.awk
unchanged as follows:
$ gawk --posix -f guide.awk -f libintl.awk
Don't Panic
The Answer Is 42
Pardon me, Zaphod who?
gawk
itself has been internationalized using the GNU
gettext
package. (GNU gettext
is described in complete detail in GNU gettext
utilities.) As
of this writing, the latest version of GNU gettext
is version
0.19.4.
If a translation of gawk
’s
messages exists, then gawk
produces
usage messages, warnings, and fatal errors in the local language.
Internationalization means writing a program such that it can use multiple languages without requiring source code changes. Localization means providing the data necessary for an internationalized program to work in a particular language.
gawk
uses GNU gettext
to let you internationalize and
localize awk
programs. A program’s
text domain identifies the program for grouping all messages and other
data together.
You mark a program’s strings for translation by preceding them
with an underscore. Once that is done, the strings are extracted into
a .pot
file. This file is copied
for each language into a .po
file, and the .po
files are
compiled into .gmo
files for use
at runtime.
You can use positional specifications with sprintf()
and printf
to rearrange the placement of
argument values in formatted strings and output. This is useful for
the translation of format control strings.
The internationalization features have been designed so that
they can be easily worked around in a standard awk
.
gawk
itself has been
internationalized and ships with a number of translations for its
messages.
[82] For some operating systems, the gawk
port doesn’t support GNU gettext
. Therefore, these features are not
available if you are using one of those operating systems.
Sorry.
[83] Americans use a comma every three decimal places and a period for the decimal point, while many Europeans do exactly the opposite: 1,234.56 versus 1.234,56.
[84] Thanks to Bruno Haible for this example.
[85] The xgettext
utility that
comes with GNU gettext
can handle
.awk
files.
[86] This example is borrowed from the GNU gettext
manual.
[87] This is good fodder for an “Obfuscated awk
” contest.
[88] Perhaps it would be better if it were called “Hippy.” Ah, well.
18.119.172.146