Multi-Byte Character Sets

Most programmers are accustomed to working with single-byte character sets. In the U.S., we like to pretend that ASCII is the only meaningful mapping between characters and numbers. This is not the case. Standard organizations such as ANSI (American National Standards Institute) and the ISO (International Standards Organization) have defined many different encodings that associate a unique number with each character in a given character set. Theoretically, a single-byte character set can encode 256 different characters. In practice, however, most single-byte character sets are limited to about 96 visible characters. The range of values is cut in half by the fact that the most significant bit is sometimes considered off-limits when representing characters. The most significant bit is often used as a parity bit and occasionally as an end-of-string marker. Of the remaining 127 encodings, many are used to represent control characters (such as tab, new-line, carriage return, and so on). By the time you add punctuation and numeric characters, the remaining 96 values start feeling a bit cramped.

Single-byte character sets work well for languages with a relatively small number of characters. Eventually, most of us must make the jump to multi-byte encodings. Adding a second byte dramatically increases the number of characters that you can represent. A single-byte character set can encode 256 values; a double-byte set can encode 65,536 characters. Multi-byte character sets are required for some languages, particularly languages used in East Asia. Again, standards organizations have defined many multi-byte encoding standards.

The Unicode Consortium was formed with the goal of providing a single encoding for all character sets. The consortium published its first proposed standard in 1991 (“The Unicode Standard, Version 1.0”). A two-byte number can represent most of the Unicode encoding values. Some characters require more than two bytes. In practice, many Unicode characters require a single byte.

I've always found that the various forms of the Unicode encoding standard were difficult to understand. Let me try to explain the problem (and Unicode's solution) with an analogy.

Suppose you grabbed a random byte from somewhere on the hard drive in your computer. Let's say that the byte you select has a value of 48. What does that byte mean? It might mean the number of states in the contiguous United States. It might mean the character '0' in the ASCII character set. It could represent 17 more than the number of flavors you can get at Baskin-Robbins. Let's assume that this byte represents the current temperature. Is that 48° in the Centigrade, Fahrenheit, Kelvin, Réaumur, or Rankine scale? The distinction is important: 48° is a little chilly in Fahrenheit, but mighty toasty in Centigrade.

There are two levels of encoding involved here. The lowest level of encoding tells us that 48 represents a temperature value. The higher level tells us that the temperature is expressed in degrees Fahrenheit. We have to know both encodings before we can understand the meaning of the byte. If we don't know the encoding(s), 48 is just data. After we understand the encodings, 48 becomes information.

Unicode is an encoding system that assigns a unique number to each character. Which characters are included in the Unicode Standard? Version 3.0 of the Unicode Standard provides definitions for 49,194 characters. Version 3.1 added 44,946 character mappings, and Version 3.2 added an additional 1,016 for a total of 95,156 characters. Version 4.0 takes the total 96,832 characters. I'd say that the chances are very high that any character you need is defined in the Unicode Standard.

Just like the temperature encodings I mentioned earlier, there are two levels of encoding in the Unicode Standard.

At the most fundamental level, Unicode assigns a unique number, called a code point, to each character. For example, the Latin capital 'A' is assigned the code point 65. The Cyrillic (Russian) capital de ('') is assigned the value 0414. The Unicode Standard suggests that we write these values using the form 'U+xxxx' where 'xxxx' is the code point expressed in hexadecimal notation. So, we should write U+0041 and U+0414 to indicate the Unicode mappings for 'A' and ''. The mapping from characters to numbers is called the Universal Character Set, or UCS.

At the next level, each code point is represented in one of several UCS transformation formats (UTF). The most commonly seen UTF is UTF-8[1]. The UTF-8 scheme is a variable-width encoding form, meaning that some code points (that is, some characters) are represented by a single byte; and others represented by two, three, or four bytes. UTF-8 divides the Unicode code point space into four ranges, with each range requiring a different number of bytes as shown in Table 22.3.

[1] Other UTF encodings are UTF-16BE (variable-width, 16 bit, big-endian), UTF-16LE (variable-width, 16 bit, little-endian), UTF-32BE, and UTF-32LE.

Table 22.3. UTF-8 Code Point Widths
Low ValueHigh ValueStorage SizeSample CharacterUTF8-Encoding
U+0000U+007F1 byteA(U+0041)0x41
   0(U+0030)0x30
U+0080U+07FF2 bytes©(U+00A9)0xC2 0xA9
   (U+00E6)0xC3 0xA6
U+0800U+FFFF3 bytes(U+062C)0xE0 0x86 0xAC
   (U+20AC)0xE2 0x82 0xAC
U+10000U+10FFFF4 bytes(U+1D160)0xF0 0x8E 0xA3 0xA0
   ∑(U+1D6F4)0xF0 0x9D 0x9B 0xB4

The Unicode mappings for the first 127 code points are identical to the mappings for the ASCII character set. The ASCII code point for 'A' is 0x41, the same code point is used to represent 'A' in Unicode. The UTF-8 encodings for values between 0 and 127 are the values 0 through 127. The net effect of these two rules is that all ASCII characters require a single byte in the UTF-8 encoding scheme and the ASCII characters map directly into the same Unicode code points. In other words, an ASCII string is identical to the UTF-8 string containing the same characters. UTF-8 isn't the only transformation format. The disadvantage to UTF-8 is that it is a variable-width encoding form. Variable-width forms can be difficult to handle in some applications. UTF-16 is another common transformation format: Each character requires two bytes in UTF-16. You may be thinking that you can't encode the 95,156 characters defined by Unicode 4.0 in a two-byte value. You're right; you can't. UTF-16 was a fixed-width encoding until Unicode version 3.0. Starting with version 3.1, the Unicode standard defined more than 65,535 mappings (65,535 is the number of different encodings you can store in a two-byte value). To get around this limitation, the Unicode consortium invented the surrogate pair. A surrogate pair is simply a way of encoding a single character in two two-byte values. That means that UTF-16 is now a variable-width encoding. The only fixed-width encoding currently defined by the Unicode consortium is UTF-32 (aka UCS-4).

PostgreSQL understands how to store and manipulate characters (and strings) expressed in Unicode/UTF-8. PostgreSQL can also work with multibyte encodings other than Unicode/UTF-8. In fact, PostgreSQL understands single-byte encodings other than ASCII.

Encodings Supported by PostgreSQL

PostgreSQL does not store a list of valid encodings in a table, but you can create such a table. Listing 22.1 shows a PL/pgSQL function that creates a temporary table (encodings) that holds the names of all encoding schemes supported by our server.

Listing 22.1. get_encodings.sql
 1 --
 2 -- Filename: get_encodings.sql
 3 --
 4
 5 CREATE OR REPLACE FUNCTION get_encodings() RETURNS INTEGER AS 
 6 '
 7   DECLARE
 8     enc     INTEGER := 0;
 9     name    VARCHAR;
10   BEGIN
11     CREATE TEMP TABLE encodings ( enc_code int, enc_name text );
12     LOOP
13         SELECT INTO name pg_encoding_to_char( enc );
14 
15         IF( name = '''' ) THEN
16             EXIT;
17         ELSE
18             INSERT INTO encodings VALUES( enc, name );
19         END IF;
20 
21         enc := enc + 1;
22     END LOOP;
23 
24     RETURN enc;
25   END;
26
27 ' LANGUAGE 'plpgsql';

get_encodings() assumes that encoding numbers start at zero and that there are no gaps. This may not be a valid assumption in future versions of PostgreSQL. We use the pg_encoding_to_char() built-in function to translate an encoding number into an encoding name. If the encoding number is invalid, pg_encoding_to_char() returns an empty string.

When you call get_encodings(), it will return the number of rows written to the encodings table.

movies=# select get_encodings(); 
 get_encodings 
---------------
            34
(1 row)

movies=# select * from encodings;
 enc_code |   enc_name
----------+---------------
        0 | SQL_ASCII
        1 | EUC_JP
        2 | EUC_CN
        3 | EUC_KR
        4 | EUC_TW
        5 | JOHAB
        6 | UNICODE
        7 | MULE_INTERNAL
        8 | LATIN1
        9 | LATIN2
       10 | LATIN3
       11 | LATIN4
       12 | LATIN5
       13 | LATIN6
       14 | LATIN7
       15 | LATIN8
       16 | LATIN9
       17 | LATIN10
       18 | WIN1256
       19 | TCVN
       20 | WIN874
       21 | KOI8
       22 | WIN
       23 | ALT
       24 | ISO_8859_5
       25 | ISO_8859_6
       26 | ISO_8859_7
       27 | ISO_8859_8
       28 | WIN1250
       29 | SJIS
       30 | BIG5
       31 | GBK
       32 | UHC
       33 | GB18030
(34 rows)

Some of these encoding schemes use single-byte code points: SQL_ASCII, LATIN*, KOI8, WIN, ALT, ISO-8859*. Table 22.4 lists the encodings supported by PostgreSQL version 8.0.0.

Table 22.4. Supported Encoding Schemes
EncodingDefined BySingle or MultibyteLanguages Supported
SQL_ASCIIASCIIS 
EUC_JPJIS X 0201-1997MJapanese
EUC_CNRFC 1922MChinese
EUC_KRRFC 1557MKorean
EUC_TWCNS 11643-1992MTraditional Chinese
JOHABKS C 5601-1992 annex 3MExtended Korean
UNICODEUnicode ConsortiumMAll scripts
MULE_INTERNALCNS 116643-1992
LATIN1ISO-8859-1SWestern Europe
LATIN2ISO-8859-2SEastern Europe
LATIN3ISO-8859-3SSouthern Europe
LATIN4ISO-8859-4SNorthern Europe
LATIN5ISO-8859-9STurkish
LATIN6ISO-8859-10SNordic
LATIN7ISO-8859-13SBaltic Rim
LATIN8ISO-8859-14SCeltic
LATIN9ISO-8859-15SSimilar to LATIN1, replaces some characters with French and Finnish characters, adds Euro
LATIN10ISO-8859-16SRomanian
WIN1256Windows 1256SArabic
TCVNTCVN 5712:1993SVietnamese
WIN874Windows 875SThai
KOI8RFC 1489SCyrillic
WINWindows 1251SCyrillic
ALTIBM866SCyrillic
ISO_8859_5ISO-8859-5SCyrillic
ISO_8859_6ISO-8859-6SArabic
ISO_8859_7ISO-8859-7SGreek
ISO_8859_8ISO-8859-8SHebrew
SJISJIS X 0202-1991MJapanese
BIG5RF 1922MChinese for Taiwan
GBKGB 13000.1-93MExtended Chinese
UHCWindows 949 (and others)MUnified Hangul
GB18030GB 18030-2000MChinese ideograms
WIN1250Windows 1250SEastern Europe

I've spent a lot of time talking about Unicode. As you can see from Table 22.4, you can use other encodings with PostgreSQL. Unicode has one important advantage over other encoding schemes. A character in any other encoding system can be translated into Unicode and translated back into the original encoding system.

You can use Unicode as a pivot to translate between other encodings. For example, if you want to translate common characters from ISO-646-DE (German) into ISO-646-DK (Danish), you can first convert all characters into Unicode (all ISO-646-DE characters will map into Unicode) and then map from Unicode back to ISO-646-DK. Some German characters will not translate into Danish. For example, the DE+0040 character ('') will map to Unicode U+00A7. There is no '' character in the ISO-646-DK character set, so this character would be lost in the translation (not dropped, just mapped into a value that means “no translation”).

If you don't use Unicode to translate between character sets, you'll have to define translation tables for every language pair that you need. The CREATE CONVERSION command defines a conversion from one encoding to another. PostgreSQL provides a number of pre-defined conversions (114 in version 8.0.0).

If you need to support more than one character set at your site, I would strongly encourage you to encode your data in Unicode. If you store mostly US-ASCII characters, UTF-8 will save you space. If all of the characters that you need to store are defined in a single-byte character set, use that set. If all of the characters that you need to store are defined by a fixed-width, multibyte character set, you need to choose between that character set and Unicode.

Enabling Multi-Byte Support

When you build PostgreSQL from source code, multibyte support is disabled by default. Unicode is a multibyte character set—if you want to use Unicode, you need to enable multibyte support. Starting with PostgreSQL release 7.3, multibyte support is enabled by default. If you are using a version earlier than 7.3, you enable multibyte support by including the --enable-multibyte option when you run configure:

./configure --enable-multibyte

If you did not compile your own copy of PostgreSQL, the easiest way to determine whether it was compiled with multi-byte support is to invoke psql, as follows:

$ psql -l
      List of databases
   Name      | Owner | Encoding  
-------------+-------+-----------
 movies      | bruce | SQL_ASCII
 secondbooks | bruce | UNICODE

The -l flag lists all databases in a cluster. If you see three columns, multi-byte support is enabled. If the Encoding column is missing, you don't have multi-byte support.

Selecting an Encoding

There are four ways to select the encoding that you want to use for a particular database.

When you create a database using the createdb utility or the CREATE DATABASE command, you can choose an encoding for the new database. The following four commands are equivalent:

$ createdb -E latin5 my_turkish_db
$ createdb --encoding=latin5 my_turkish_db

movies=# CREATE DATABASE my_turkish_db WITH ENCODING 'LATIN5';
movies=# CREATE DATABASE my_turkish_db WITH ENCODING 11;

If you don't specify an encoding with createdb (or CREATE DATABASE), the cluster's default encoding is used. You specify the default encoding for a cluster when you create the cluster using the initdb command:

$ initdb -E EUC_KR
$ initdb --encoding=EUC_KR

If you do not specify an encoding when you create the database cluster, initdb uses the encoding specified when you configured the PostgreSQL source code:

./configure --enable-multibyte=unicode

Finally, if you don't include an encoding name when you configure the PostgreSQL source code, SQL_ASCII is assumed.

So, if you don't do anything special, your database will not support multi-byte encodings, and all character values are assumed to be expressed in SQL_ASCII.

If you enable multi-byte encodings, all encodings are available. The encoding name that you can include in the --enable-multibyte flag selects the default encoding; it does not limit the available encodings.

Client/Server Translation

You now know that the PostgreSQL server can deal with encodings other than SQL_ASCII, but what about PostgreSQL clients? That question is difficult to answer. The pgAdmin and pgAdmin II clients do not. pgAccess does not. The psql client supports multi-byte encodings, but finding a font that can display all required characters is not easy.

Assuming that you are using a client application that supports encodings other than SQL_ASCII, you can select a client encoding with the SET CLIENT_ENCODING command:

movies=# SET CLIENT_ENCODING TO UNICODE;
SET

You can see which coding has been selected for the client using the SHOW CLIENT_ENCODING command:

movies=# SHOW CLIENT_ENCODING;
NOTICE:  Current client encoding is 'UNICODE'
SHOW VARIABLE

You can also view the server's encoding (but you can't change it):

movies=# SHOW SERVER_ENCODING;
NOTICE:  Current server encoding is 'UNICODE'
SHOW VARIABLE
movies=# SET SERVER_ENCODING TO BIG5;
NOTICE:  SET SERVER_ENCODING is not supported
SET VARIABLE 

If the CLIENT_ENCODING and SERVER_ENCODING are different, PostgreSQL will convert between the two encodings. In many cases, translation will fail. Let's say that you use a multi-byte-enabled client to INSERT some Katakana (that is, Japanese) text, as shown in Figure 22.1.

Figure 22.1. A Unicode-enabled client application.


This application (the Conjectrix Workstation) understands how to work with Unicode data. If you try to read this data with a different client encoding, you probably won't be happy with the results:

$ psql -q -d movies
news=# SELECT tape_id, title FROM tapes WHERE tape_id = 'JP-35872';
tape_id   |                title
----------+----------------------------------------------------------
 JP-35872 | (bb)(bf)(e5)(a4)(a9)(e7)(a9)(ba)(e3)(81ae)(e5)(9f)(8e)...
(1 row)

The values that you see in psql have been translated into the SQL_ASCII encoding scheme. Some characters in the title column can be translated from Unicode into SQL_ASCII, but most cannot. The SQL_ASCII encoding does not include Katakana characters, so PostgreSQL has given you the hexadecimal values of the Unicode characters instead.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.68.49