Chapter 5. Data Storage

It is not an overstatement to say that every software application processes some kind of data. Without data, an application's functionality would be rather modest, after all. Data can come from different sources—such as user input, network communication, and so on. It can also be stored locally on behalf of the application.

Knowing that data handling is an inevitable part of your code, this chapter will focus on it, and particularly on textual data. If you ask "why?", the answer is simple: binary data is rather simple and therefore simply boring—it is simply a sequence of bytes, and they can represent virtually anything. In case of text data, the situation differs—there are some common practices and libraries for processing it, and that's what we are going to discuss further. There is also a trend to prefer text data over binary in many cases—the success of text-based XML, for example, proves this.

The first topic will focus on the internationalization of your application. We will discuss the issues with proper encoding of your text data, and how to avoid common pitfalls. We will then briefly present how to create a multi-language program using the powerful package offered by Tcl—msgcat.

The next item will cover accessing relational databases using SQL. We will present how to access MySQL or PostgreSQL databases, and then we will focus on the Holy Grail of the Tcl database world, that is the SQLite database engine. Originally created by Tcl users for Tcl users, it was later widely recognized beyond that loop, and became a popular solution used in many programming languages and by many corporations, such as Adobe, Apple, Google, and Sun.

We will conclude by touching the topic strictly related to networking and data exchange—XML. The XML format is very relevant in case of transmitting text data over the network. With the focus on the tdom package, we will present how to parse XML data and access particular information stored inside it. You will also learn how to create an XML document.

Finally, we will discuss the possibility of storing raw, unformatted Tcl data structures such as lists or dictionaries inside the flat text files. The way Tcl treats such structures makes this task simple, and therefore a handy, yet primitive, solution.

Internationalizations

Almost every computer software application communicates somehow with the user or support person in terms of human-readable text messages. When you start developing an application, you may be tempted to use the easiest form, for example using puts (or some other command) command with the message in your native language. It may work pretty well in the case of small, one-evening applications, but when it comes to mature-level software that is going to be presented to the world, you will soon search for an easy way to make your application speak in different languages.

Encoding issues

Internally, Tcl uses Unicode to store any string data in memory. We assume the reader is familiar with Unicode and UTF-8 terms, and describing it is beyond the scope of this book. For more information, you can visit http://www.unicode.org or simply search the network for interesting articles, because the topic is widely covered.

As UTF-8 encoding solves most of the problems with internationalization (the term is often abbreviated to i18n for easier reading, as there are 18 characters in this word between 'i' and 'n'), Tcl may be considered a mature solution in this matter. You do not even have to remember about any encoding issues as long as you operate on Tcl string data.

Basically, Tcl reads all files using the system encoding (for example, for those using English in Windows, it is Windows-1252) and converts them to Unicode. The same is in case of writing—by default, data is converted to system encoding.

It may (and based on general experience, it will) happen that you want to read the contents of a file that is encoded in a different format than that of the system. Therefore, you will have to alter the default conversion by reconfiguration of data channel, and to do this, you can use the Tcl command:

fconfigure $channelId encoding encodingName

channelId is nothing more than the identifier of the channel returned by the open command. When you select the encoding by specifying its name—encodingName—Tcl will treat incoming data as if it is encoded in the specified format and convert it to Unicode, and it converts outgoing data from Unicode to the target encoding. The encoding names command returns a list of all available encoding names. If you skip encodingName, the fconfigure command will return the current encoding for specified channel.

Let's illustrate what is written up to this moment with an example. Assume that we have two text files—utf8.txt and cp1250.txt, encoded in UTF-8 and Cp1250, respectively. The following code sample shows how Tcl will behave when reading them:

puts "system encoding is Cp1250, so the file will be read correctly:"
puts [read [open cp1250.txt r]]
puts "
and this one will be malformed:"
puts [read [open utf8.txt r]]
puts "
all known encodings are:"
puts [encoding names]
puts "
after proper configuration of encoding for the channel:"
set channelId [open utf8.txt r]
fconfigure $channelId -encoding utf-8
puts [read $channelId]
close $channelId

The output is:

system encoding is Cp1250, so the file will be read correctly:
This file is encoded with Cp1250 and contains some Polish characters:
zażółć gęślą jaźń
and this one will be malformed:
This file is encoded with UTF-8 and contains some Polish characters:
zażółć gęślÄ... jaĹşĹ"
all known encodings are:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857
after proper configuration of encoding for the channel:
This file is encoded with UTF-8 and contains some Polish characters:
zażółć gęślą jaźń

The system encoding in this case is Cp1250, so the cp1250.txt file does not require any special action to be read properly. Ironically, when reading from a file in UTF-8, which is the core encoding for Tcl, the channel must be properly configured, otherwise the data is corrupted by incorrect conversion, as it is treated as encoded in Cp1250, not UTF-8. You can also see that the list of supported encodings is quite impressive.

Apart from altering configuration for the channel, Tcl also offers the possibility to convert a string to different encoding with the following commands:

encoding convertfrom encodingName $string
encoding convertto encodingName $string

The command syntax may strike you as confusing, but it is self explanatory; it always converts from encodingName to Unicode or from Unicode to encodingName. If the name of the encoding is omitted, the current default system encoding will be used. To clear up any confusion, let's illustrate it with one-liner example:

puts [encoding convertfrom cp1250 "zaxBFxF3xB3xE6 gxEAx9ClxB9 jax9FxF1"]

In this code, x and any following octal digits correspond to the appropriate character codes in Cp1250 standard, so the string is in Cp1250 encoding and after conversion it becomes UTF-8 encoded, which allows it to be printed correctly:

zażółć gęślą jaźń

In a similar way we can read the file utf8.txt mentioned earlier:

set channelId [open utf8.txt r]
fconfigure $channelId -encoding binary
set data [read $channelId]
puts [encoding convertfrom utf-8 $data]

Once the data is read in binary mode, it is treated as an array of bytes, so although the file content itself is encoded in UTF-8, puts $data would produce garbage, as each byte would be treated as separate character code, and in UTF-8 characters may be encoded in more than one byte. Usage of the encoding convert from utf-8 command allows us to correctly decode the binary data into human readable content:

This file is encoded with UTF-8 and contains some Polish characters:
zażółć gęślą jaźń

In most cases Tcl does not have any problems with correct detection of system encoding. In the rare cases when it fails, it will fall back to ISO 8859-1. At any time you can retrieve the detected encoding with the command encoding system. There is also the possibility to force system encoding other than the detected one with encoding system encodingName command, but it is generally best not to alter this.

As mentioned earlier, by default Tcl reads files using system encoding. This is also true when it comes to loading additional scripts with source command, so generally it is advised that all script files should be encoded with the default system encoding.

When it comes to using some Unicode characters, you can always write them in the form uxxxx, where xxxx should be replaced with correct, four digit hexadecimal character Unicode code value.

If you have your Tcl script encoded in the format other than system one, you may use the following workaround (let's assume script.tcl file is encoded in UTF-8):

set channel [open script.tcl r]
fconfigure $channel encoding utf-8
set script [read $channel]
close $channel
eval $script

What this does is read the content of file using correct encoding conversion into a variable, and next pass this variable to eval command.

Starting from Tcl 8.5, the source command may be instructed about the encoding of the file that is to be sourced. For example, to force our script.tcl file to be read as UTF-8 encoded, all we have to do is:

source -encoding utf-8 script.tcl

Translating your application into different languages

Tcl offers such a solution in the shape of the msgcat package (short for message catalog). The functionality it offers is so vital that the package is shipped with every Tcl interpreted since version 8.1. The manual may be found among the full documentation of Tcl—at the time of writing, the manual is located at http://www.tcl.tk/man/tcl8.5/TclCmd/msgcat.htm.

The basic concept is simple—you have to translate every string used in your application to the languages you are going to support. All translations should be stored in a directory (the name is your choice; in this chapter we assume it is messages). If a translation is missing, the untranslated source string will be used. For every language there should be one corresponding file with a .msg extension. The name of each file is directly related to the locale identifier for the language it contains. The following formats of locale identifier are supported:

  • language_country_modifier—for example en_GB_Funky
  • language_country—for example en_GB
  • language—for example en

The language and country codes are defined in standards ISO-639 and ISO-3166, and the modifier may be a string of your choice. If there is more than one .msg file for given language, the best matching file is used. For example, if the system locale is "en_US", and we have both an en.msg and an en_US.msg files, the second one would be used. If only en.msg is present, then this file would be chosen as the best (and only) match.

Each of the translation files contains a set of calls to the command defining translations for every string used in your application. The command is:

::msgcat::mcset locale string translation


  • locale parameter is in the same format as described earlier
  • string is a source, untranslated string
  • translation is a translated string; if the relevant translation has been is omitted, the source string will be used

In reality, each translation file will be evaluated as a normal Tcl script, so it should be encoded in UTF-8. For example, let's assume that we want to define translations for the Polish and Spanish languages. The messages directory will contain two files:

  • pl.msg:
    ::msgcat::mcset pl "Hello World" "Witaj świecie"
    
    
  • es.msg:
    ::msgcat::mcset es "Hello World" "Hola Mundo"
    
    

Each of these contains a translation of "Hello World" string to the appropriate language.

To start using translations, all you have to do is load the translation with the command ::msgcat::mcload directory_name (the parameter specifies directory where translations are located). Next, you use the::msgcat::mc string command, which will substitute string with appropriate translation.

The following example illustrates the usage of msgcat package:

package require msgcat
puts "system locale : [msgcat::mclocale]"
puts "system preferences: [msgcat::mcpreferences] 
"
foreach locale {pl_PL es en} {
msgcat::mclocale $locale
msgcat::mcload [file join [file dirname [info script]] messages]
puts "current locale are: [msgcat::mclocale]"
puts "current preferences are: [msgcat::mcpreferences]"
puts "Translated message is: [::msgcat::mc "Hello World"]"
}

Basically the code sets each different locale: Polish, Spanish, and English, and allows verification that the string "Hello World" has been appropriately translated. The output of this example is:

system locale : pl
system preferences: pl {}
current locale are: pl_pl
current preferences are: pl_pl pl {}
Translated message is: Witaj Świecie
current locale are: es
current preferences are: es {}
Translated message is: Hola Mundo
current locale are: en
current preferences are: en {}
Translated message is: Hello World

The command msgcat::mclocale returns the current locale identifier (by default, system locale), but it can be also used to set a new locale, as depicted in the example. Note that after each change of the current locale, the command ::msgcat::mcload must be called again, to find and load the matching translation file.

Another command, msgcat::mcpreferences, returns an ordered list (starting from the most specific) of locale identifiers that will be used to match the .msg file. As you can see, the preferences list for the locale pl_PL is: pl_pl pl {}, so the most preferred translation file would be pl_pl.msg, then pl.msg and finally no translation at all.

It is worth noting that the msgcat package is aware of namespaces. What this means is that translations from different namespaces are handled separately, which prevents possible side effects between different packages, which may arise if both packages were to try translating the same string to a different message. What is more, if the translation is not found in the current namespace, msgcat will search for it in the parent namespaces until the global namespace is reached. Here is an example of such a behavior—the content of translation file:

::msgcat::mcset pl "test message" "wiadomość testowa"
::msgcat::mcset pl "test message2" "wiadomość testowa2"
namespace eval ::test {
::msgcat::mcset pl "test message" "to jest wiadomość testowa"
}

Example code:

package require msgcat
msgcat::mclocale pl
msgcat::mcload [file join [file dirname [info script]] messages]
puts [::msgcat::mc "test message"]
puts [::msgcat::mc "test message2"]
namespace eval ::test {
puts [::msgcat::mc "test message"]
puts [::msgcat::mc "test message2"]
}

The output:

wiadomość testowa
wiadomość testowa2
to jest wiadomość testowa
wiadomość testowa2

As you can see, the"test message" translation will vary based on the namespace, and the translation of"test message2" is obtained from the global namespace, as the test namespace does not contain the definition for it.

It is worth noting that the ::msgcat:mc command can accept additional parameters apart from the string to translate, and in such a case, the Tcl format command is used for parameter substitution. It also enables a different order of supplied values in the output string, as shown in the following example:

The translation:

::msgcat::mcset pl "January" "Styczeń"
::msgcat::mcset pl "date: the %d of %s" "Data: %2$s, dnia %1$d"

The code:

package require msgcat
foreach locale {pl en} {
msgcat::mclocale $locale
msgcat::mcload [file join [file dirname [info script]] messages]
puts [::msgcat::mc "date: the %d of %s" 15 [::msgcat::mc January]]
}

The output clearly shows the different positions of the supplied arguments, depending on the locale:

Data: Styczeń, dnia 15
date: the 15 of January
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.156.235