Chapter 15. Internationalization

This chapter describes features that support text processing for different character sets such as ASCII and Japanese. Tcl can read and write data in various character set encodings, but it processes data in a standard character set called Unicode. Tcl has a message catalog that lets you generate different versions of an application for different languages. Tcl commands described are: encoding and msgcat.

Different languages use different alphabets, or character sets. An encoding is a standard way to represent a character set. Tcl hides most of the issues associated with encodings and character sets, but you need to be aware of them when you write applications that are used in different countries. You can also write an application using a message catalog so that the strings you display to users can be in the language of their choice. Using a message catalog is more work, but Tcl makes it as easy as possible.

Most of the hard work in dealing with character set encodings is done “under the covers” by the Tcl C library. The Tcl C library underwent substantial changes to support international character sets. Instead of using 8-bit bytes to store characters, Tcl uses a 16-bit character set called Unicode, which is large enough to encode the alphabets of all languages. There is also plenty of room left over to represent special characters like Internationalization and ⊗.

In spite of all the changes to support Unicode, there are few changes visible to the Tcl script writer. Scripts written for Tcl 8.0 and earlier continue to work fine with Tcl 8.1 and later versions. You only need to modify scripts if you want to take advantage of the features added to support internationalization.

This chapter begins with a discussion of what a character set is and why different codings are used to represent them. It concludes with a discussion of message catalogs.

Character Sets and Encodings

If you are from the United States, you've probably never thought twice about character sets. Most computers use the ASCII encoding, which has 127 characters. That is enough for the 26 letters in the English alphabet, upper case and lower case, plus numbers, various punctuation characters, and control characters like tab and newline. ASCII fits easily in 8-bit characters, which can represent 256 different values.

European alphabets include accented characters like è, ñ, and ä. The ISO Latin-1 encoding is a superset of ASCII that encodes 256 characters. It shares the ASCII encoding in values 0 through 127 and uses the “high half” of the encoding space to represent accented characters as well as special characters like ©. There are several ISO Latin encodings to handle different alphabets, and these share the trick of encoding ASCII in the lower half and other characters in the high half. You might see these encodings referred to as iso8859-1, iso8859-2, and so on.

Asian character sets are simply too large to fit into 8-bit encodings. There are a number of 16-bit encodings for these languages. If you work with these, you are probably familiar with the “Big 5” or ShiftJIS encodings.

Unicode is an international standard character set encoding. There are both 16-bit Unicode and 32-bit Unicode standards, but Tcl and just about everyone else use the 16-bit standard. Unicode has the important property that it can encode all the important character sets without conflicts and overlap. By converting all characters to the Unicode encoding, Tcl can work with different character sets simultaneously. As of 8.4, Tcl is compliant with Unicode v3.1. For more information on Unicode, see http://www.unicode.org/

The System Encoding

Computer systems are set up with a standard system encoding for their files. If you always work with this encoding, then you can ignore character set issues. Tcl will read files and automatically convert them from the system encoding to Unicode. When Tcl writes files, it automatically converts from Unicode to the system encoding. If you are curious, you can find out the system encoding with:

encoding system
=> cp1252

The “cp” is short for “code page,” the term that Windows uses to refer to different encodings. On my Unix system, the system encoding is iso8859-1.

Note

The System Encodingencodingsystemsystem encoding

Do not change the system encoding.

You could also change the system encoding with:

encoding system encoding

But this is not a good idea. It immediately changes how Tcl passes strings to your operating system, and it is likely to leave Tcl in an unusable state. Tcl automatically determines the system encoding for you. Don't bother trying to set it yourself.

The encoding names command lists all the encodings that Tcl knows about. The encodings are kept in files stored in the encoding directory under the Tcl script library. They are loaded automatically the first time you use an encoding.

lsort [encoding names]
=> ascii big5 cp1250 cp1251 cp1252 cp1253 cp1254 cp1255 cp1256 cp1257 cp1258 cp437 cp737 
The System Encodingencodingsystemsystem encodingcp775 cp850 cp852 cp855 cp857 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp932
The System Encodingencodingsystemsystem encoding cp936 cp949 cp950 dingbats euc-cn euc-jp euc-kr gb12345 gb1988 gb2312 identity iso2022
The System Encodingencodingsystemsystem encoding iso2022-jp iso2022-kr iso8859-1 iso8859-2 iso8859-3 iso8859-4 iso8859-5 iso8859-6
The System Encodingencodingsystemsystem encoding iso8859-7 iso8859-8 iso8859-9 jis0201 jis0208 jis0212 ksc5601 macCentEuro macCroatian
The System Encodingencodingsystemsystem encoding macCyrillic macDingbats macGreek macIceland macJapan macRoman macRomania macThai
The System Encodingencodingsystemsystem encoding macTurkish macUkraine shiftjis symbol unicode utf-8

The encoding names reflect their origin. The "cp" refers to the "code pages" that Windows uses to manage encodings. The "mac" encodings come from the Macintosh. The "iso," "euc," "gb," and "jis" encodings come from various standards bodies.

File Encodings and fconfigure

The conversion to Unicode happens automatically in the Tcl C library. When Tcl reads and writes files, it translates from the current system encoding into Unicode. If you have files in different encodings, you can use the fconfigure command to set the encoding. For example, to read a file in the standard Russian encoding (iso8859-7):

set in [open README.russian]
fconfigure $in -encoding iso8859-7

Example 15-1 shows a simple utility I use in exmh,[*] a MIME-aware mail reader. MIME has its own convention for specifying the character set encoding of a mail message that differs slightly from Tcl's naming convention. The procedure launders the name and then sets the encoding. Exmh was already aware of MIME character sets, so it could choose fonts for message display. Adding this procedure and adding two calls to it was all I had to do to adapt exmh to Unicode.

Example 15-1. MIME character sets and file encodings

proc Mime_SetEncoding {file charset} {
   regsub -all {(iso|jis|us)-} $charset {1} charset
   set charset [string tolower charset]
   regsub usascii $charset ascii charset
   fconfigure $file -encoding $charset
}

Scripts in Different Encodings

If you have scripts that are not in the system encoding, then you cannot use source to load them. However, it is easy to read the files yourself under the proper encoding and use eval to process them. Example 15-2 adds a -encoding flag to the source command. This is likely to become a built-in feature in future versions of Tcl so that commands like info script will work properly:

Example 15-2. Using scripts in nonstandard encodings

proc Source {args} {
   set file [lindex $args end]
   if {[llength $args] == 3 &&
          [string equal -encoding [lindex $args 0]]} {
      set encoding [lindex $args 1]
      set in [open $file]
      fconfigure $in -encoding $encoding
      set script [read $in]
      close $in
      return [uplevel 1 $script]
   } elseif {[llength $args] == 1} {
      return [uplevel 1 [list source $file]]
   } else {
      return -code error 
         "Usage: Source ?-encoding encoding? file?"
   }
}

Unicode and UTF-8

UTF-8 is an encoding for Unicode. While Unicode represents all characters with 16 bits, the UTF-8 encoding uses either 8, 16, or 24 bits to represent one Unicode character. This variable-width encoding is useful because it uses 8 bits to represent ASCII characters. This means that a pure ASCII string, one with character codes all less than 128, is also a UTF-8 string. Tcl uses UTF-8 internally to make the transition to Unicode easier. It allows interoperability with Tcl extensions that have not been made Unicode-aware. They can continue to pass ASCII strings to Tcl, and Tcl will interpret them correctly.

As a Tcl script writer, you can mostly ignore UTF-8 and just think of Tcl as being built on Unicode (i.e., full 16-bit character set support). If you write Tcl extensions in C or C++, however, the impact of UTF-8 and Unicode is quite visible. This is explained in more detail in Chapter 47.

Tcl lets you read and write files in UTF-8 encoding or directly in Unicode. This is useful if you need to use the same file on systems that have different system encodings. These files might be scripts, message catalogs, or documentation. Instead of using a particular native format, you can use Unicode or UTF-8 and read the files the same way on any of your systems. Of course, you will have to set the encoding properly by using fconfigure as shown earlier.

The Binary Encoding

If you want to read a data file and suppress all character set transformations, use the binary encoding:

fconfigure $in -encoding binary

Under the binary encoding, Tcl reads in each 8-bit byte and stores it into the lower half of a 16-bit Unicode character with the high half set to zero. During binary output, Tcl writes out the lower byte of each Unicode character. You can see that reading in binary and then writing it out doesn't change any bits. Watch out if you read something in one encoding and then write it out in binary. Any information in the high byte of the Unicode character gets lost!

Tcl actually handles the binary encoding more efficiently than just described, but logically the previous description is still accurate. As described in Chapter 47, Tcl can manage data in several forms, not just strings. When you read a file in binary format, Tcl stores the data as a ByteArray that is simply 8 bits of data in each byte. However, if you ask for this data as a string (e.g., with the puts command), Tcl automatically converts from 8-bit bytes to 16-bit Unicode characters by setting the high byte to all zeros.

The binary command also manipulates data in ByteArray format. If you read a file with the binary encoding and then use the binary command to process the data, Tcl will keep the data in an efficient form.

The string command also understands the ByteArray format, so you can do operations like string length, string range, and string index on binary data without suffering the conversion cost from a ByteArray to a UTF-8 string.

Conversions Between Encodings

The encoding command lets you convert strings between encodings. The encoding convertfrom command converts data in some other encoding into a Unicode string. The encoding convertto command converts a Unicode string into some other encoding. For example, the following two sequences of commands are equivalent. They both read data from a file that is in Big5 encoding and convert it to Unicode:

fconfigure $input -encoding gb12345
set unicode [read $input]

or

fconfigure $input -encoding binary
set unicode [encoding convertfrom gb12345 [read $input]]

In general, you can lose information when you go from Unicode to any other encoding, so you ought to be aware of the limitations of the encodings you are using. In particular, the binary encoding may not preserve your data if it starts out from an arbitrary Unicode string. Similarly, an encoding like iso8859-2 may simply not have a representation of a given Unicode character.

The encoding Command

Table 15-1 summarizes the encoding command:

Table 15-1. The encoding command

encoding convertfrom ?encoding?data

Converts binary data from the specified encoding, which defaults to the system encoding, into Unicode.

encoding convertto ?encoding? string

Converts string from Unicode into data in the encoding format, which defaults to the system encoding.

encoding names

Returns the names of known encodings.

encoding system ?encoding?

Queries or change the system encoding.

Message Catalogs

A message catalog is a list of messages that your application will display. The main idea is that you can maintain several catalogs, one for each language you support. Unfortunately, you have to be explicit about using message catalogs. Everywhere you generate output or display strings in Tk widgets, you need to change your code to go through a message catalog. Fortunately, Tcl uses a nice trick to make this fairly easy and to keep your code readable. Instead of using keys like “message42” to get messages out of the catalog, Tcl just uses the strings you would use by default. For example, instead of this code:

puts "Hello, World!"

A version that uses message catalogs looks like this:

puts [msgcat::mc "Hello, World!"]

If you have not already loaded your message catalog, or if your catalog doesn't contain a mapping for “Hello, World!”, then msgcat::mc just returns its argument. Actually, you can define just what happens in the case of unknown inputs by defining your own msgcat::mcunknown procedure, but the default behavior is quite good.

The message catalog is implemented in Tcl in the msgcat package. You need to use package require to make it available to your scripts:

package require msgcat

In addition, all the procedures in the package begin with “mc,” so you can use namespace import to shorten their names further. I am not a big fan of namespace import, but if you use message catalogs, you will be calling the msgcat::mc function a lot, so it may be worthwhile to import it:

namespace import msgcat::mc
puts [mc "Hello, World!"]

Specifying a Locale

A locale identifies a language or language dialect to use in your output. A three-level scheme is used in the locale identifier:

language_country_dialect

The language codes are defined by the ISO-3166 standard. For example, “en” is English and “es” is Spanish. The country codes are defined by the ISO-639 standard. For example, US is for the United States and UK is for the United Kingdom. The dialect is up to you. The country and dialect parts are optional. Finally, the locale specifier is case insensitive. The following examples are all valid locale specifiers:

es
en
en_US
en_us
en_UK
en_UK_Scottish
en_uk_scottish

Users can set their initial locale with the LANG and LOCALE environment variables. If there is no locale information in the environment, then the “c” locale is used (i.e., the C programming language.) You can also set and query the locale with the msgcat::mclocale procedure:

msgcat::mclocale
=> c
msgcat::mclocale en_US

The msgcat::mcpreferences procedure returns a list of the user's locale preferences from most specific (i.e., including the dialect) to most general (i.e., only the language). For example:

msgcat::mclocale en_UK_Scottish
msgcat::mcpreferences
=> en_UK_Scottish en_UK en

Managing Message Catalog Files

A message catalog is simply a Tcl source file that contains a series of msgcat::mcset commands that define entries in the catalog. The syntax of the msgcat::mcset procedure is:

msgcat::mcset locale src-string ?dest-string?

The locale is a locale description like es or en_US_Scottish. The src-string is the string used as the key when calling msgcat::mc. The dest-string is the result of msgcat::mc when the locale is in force.

The msgcat::mcload procedure should be used to load your message catalog files. It expects the files to be named according to their locale (e.g., en_US_Scottish.msg), and it binds the message catalog to the current namespace.

The msgcat::mcload procedure loads files that match the msgcat::mcpreferences and have the .msg suffix. For example, with a locale of en_UK_Scottish, msgcat::mcload would look for these files:

en_UK_Scottish.msg en_UK.msg en.msg

The standard place for message catalog files is in the msgs directory below the directory containing a package. With this arrangement you can call msgcat::mcload as shown below. The use of info script to find related files is explained on page 192.

msgcat::mcload [file join [file dirname [info script]] msgs]

The message catalog file is sourced, so it can contain any Tcl commands. You might find it convenient to import the msgcat::mcset procedure. Be sure to use -force with namespace import because that command might already have been imported as a result of loading other message catalog files. Example 15-3 shows three trivial message catalog files:

Example 15-3. Three sample message catalog files

## en.msg
namespace import -force msgcat::mcset

mcset en Hello Hello_en
mcset en Goodbye Goodbye_en
mcset en String String_en
# end of en.msg

## en_US.msg
namespace import -force msgcat::mcset

mcset en_US Hello Hello_en_US
mcset en_US Goodbye Goodbye_en_US
# end of en_US.msg

## en_US_Texan.msg
namespace import -force msgcat::mcset

mcset en_US_Texan Hello Howdy!
# end of en_US_Texan.msg

Assuming the files from Example 15-3 are all in the msgs directory below your script, you can load all these files with these commands:

msgcat::mclocale en_US_Texan
msgcat::mcload [file join [file dirname [info script]] msgs]

The dialect has the highest priority:

msgcat::mc Hello
=> Howdy!

If the dialect does not specify a mapping, then the country mapping is checked:

msgcat::mc Goodbye
=> Goodbye_en_US

Finally, the lowest priority is the language mapping:

msgcat::mc String
=> String_en

Message Catalogs and Namespaces

What happens if two different library packages have conflicting message catalogs? Suppose the foo package contains this call:

msgcat::set fr Hello Bonjour

But the bar package contains this conflicting definition:

msgcat::mcset fr Hello Ello

What happens is that msgcat::mcset and msgcat::mc are sensitive to the current Tcl namespace. Namespaces are described in detail in Chapter 14. If the foo package loads its message catalog while inside the foo namespace, then any calls to msgcat::mc from inside the foo namespace will see those definitions. In fact, if you call msgcat::mc from inside any namespace, it will find only message catalog definitions defined from within that namespace.

If you want to share message catalogs between namespaces, you will need to implement your own version of msgcat::mcunknown that looks in the shared location. Example 15-4 shows a version that looks in the global namespace before returning the default string.

Example 15-4. Using msgcat::mcunknown to share message catalogs

proc msgcat::mcunknown {local src} {
   variable insideUnknown
   if {![info exist insideUnknown]} {

      # Try the global namespace, being careful to note
      # that we are already inside this procedure.

      set insideUnknown true
      set result [namespace eval :: [list 
         msgcat::mc $src 
      ]]
      unset insideUnknown
      return $result
   } else {

      # Being called because the message isn't found
      # in the global namespace

      return $src
   }
}

The msgcat package

Table 15-2 summarizes the msgcat package.

Table 15-2. The msgcat package

msgcat::mc src

Returns the translation of src according to the current locale and namespace.

msgcat::mclocale ?locale?

Queries or set the current locale.

msgcat::mcmax ?src-string src-string ...?

Returns the length of the longest src-string after translation. (Tcl 8.3)

msgcat::mcpreferences

Returns a list of locale preferences ordered from the most specific to the most general.

msgcat::mcload directory

Loads message files for the current locale from directory.

msgcat::mcset locale src translation

Defines a mapping for the src string in locale to the translation string. (Tcl 8.3)

msgcat::mcmset src-trans-list

Define multiple src-translation pairs in a single call.

msgcat::mcunknown locale src

This procedure is called to resolve unknown translations. Applications can provide their own implementations.



[*] The exmh home page is http://www.beedub.com/exmh/. It is a wonderful tool that helps me manage tons of email. It is written in Tcl/Tk, of course, and relies on the MH mail system, which limits it to UNIX.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.253.152