Chapter 17. Character Sets and Collating Sequences

Introduction

In this book, the concepts character sets and collating sequence have been mentioned a few times. What these concepts exactly mean and how SQL handles them is the topic of this chapter.

Portability

In the various SQL products, support for character sets and collating sequences has been implemented very differently. For that reason, we describe the form implemented in MySQL.

To store alphanumeric values, characters such as A, b, and ë, but also special symbols such as {, &, and =, they must be converted into numeric codes. Something like a translation table must be built that contains a unique numeric code for each relevant character. Each character, therefore, gets a position in that translation table. Such a translation table is called a character set in the SQL world. In the literature, the terms code character set and character encoding are sometimes used as well.

For a character set, an encoding scheme must be invented. The character set indicates only that, for example, the uppercase letter A has the position or the code 41 and that the lowercase letter h has the position 68. But how will we store that in bytes? For each translation table, several encoding schemes can be invented; the more creative you are, the more schemes you can create. At first, you always think of a fixed number of bits and bytes for each character. So, for the storage of a character set consisting of a maximum of 256 characters, you can decide to reserve 8 bits for each character. But you could also decide to use flexible storage. For characters that occur frequently, you reserve, for example, 4 bits; for the others, you reserve 8 or 12.

Flexible lengths are also used in Morse code. In Morse, letters are represented by dots and dashes. However, not every letter has the same number of dots or dashes. The letter e, for example, is only one dash, whereas the c is built up from a dash, a dot, a dash and finally a dot—four symbols for one letter. You can also use such a solution as an encoding scheme for a character set.

So, the encoding scheme contains information about how positions such as 1, 100, and 1,000 are stored on hard disk or in internal memory.

In SQL the concepts of character sets and encoding scheme are seen as synonyms. A character set is a combination of a translation table with an encoding scheme.

Through the years, many character sets have been invented. The first standardized character set was ASCII (American Standard Code for Information Interchange), and its first version was defined in 1960 by ANSI. Another well-known character set is EBCDIC (Extended Binary Coded Decimal Interchange Code), invented by IBM. It has been the standard on the IBM mainframes for a long time.

With ASCII, the number of characters was limited to a maximum of 256 (28) characters. That used to be enough, but nowadays applications and their users require much more. Applications must be capable of handling special letters, such as β, Đ, Œ, and æ. Also, letters with all sorts of accents, such as ş, ũ, ų, and š must be processed. And then we have not even mentioned the languages in which other letters are used. Think about languages from the Middle East and the Far East. In short, 256 positions are no longer sufficient. It was time to come up with character sets that could be used to code thousands of different characters. Unicode (short for Universal Code) is one of the most used new character sets, but there are more that can hold large sets of characters.

Unicode has different encoding schemes, including UTF-8, UTF-16, and UTF-32. UTF stands for Unicode Transformation Format. These encoding schemes vary in the number of different characters they can handle and in the number of bytes they reserve for certain characters.

The concept of collating sequence deals with the sort order or the grouping of the characters. If numeric values are sorted or compared, it is always obvious how that must be done. The number 10 is smaller than the number 100, so 10 comes before 100 when sorted. Sorting and comparing alphanumeric values is not always that simple. If you have to place the words Monkey and monkey in alphabetical order, which one comes first, the spelling with the uppercase letter or the one with the lowercase letter? If you sort on the positions of the characters, the spelling with the uppercase letters will come first with character sets such as ASCII and Unicode. But is that what you want? And if that is what you want, does that mean that a user living in Georgia wants that as well? It becomes even more difficult when you want to sort the Dutch words scène, schaaf, and scepter. It is only on the third letter that these words are different. When you look at the ASCII codes for these three letters, scepter comes first, then schaaf, and finally scène sorted last. However, most users would like to see scepter and scène behind each other. But then the question is, which of these two comes first? To throw some light on this, the collating sequence has been added. For example, if a character set is assigned to a column, a collating sequence can be specified. For one character set, several collating sequences can be relevant. A collating sequence always belongs to only one character set.

Available Character Sets and Collating Sequences

During the installation of SQL, a number of character sets are introduced. This list can be retrieved by using a special SHOW statement or by querying a catalog table.

Example 17.1. Show the available character sets.

SHOW CHARACTER SET

or

SELECT   CHARACTER_SET_NAME, DESCRIPTION,
         DEFAULT_COLLATE_NAME, MAXLEN
FROM     INFORMATION_SCHEMA.CHARACTER_SETS

The result is:

CHARSET  DESCRIPTION                  DEFAULT COLLATION  MAXLEN
-------  ---------------------------  -----------------  ------
big5     Big5 Traditional Chinese     big5_chinese_ci         2
dec8     DEC West European            dec8_swedish_ci         1
cp850    DOS West European            cp850_general_ci        1
hp8      HP West European             hp8_english_ci          1
koi8r    KOI8-R Relcom Russian        koi8r_general_ci        1
latin1   ISO 8859-1 West European     latin1_swedish_ci       1
latin2   ISO 8859-2 Central European  latin2_general_ci       1
swe7     7bit Swedish                 swe7_swedish_ci         1
ascii    US ASCII                     ascii_general_ci        1
:
utf8     UTF-8 Unicode                utf8_general_ci         3
ucs2     UCS-2 Unicode                ucs2_general_ci         2
:
cp932    SJIS for Windows Japanese    cp932_japanese_ci       2
eucjpms  UJIS for Windows Japanese    eucjpms_japanese_ci     3

Explanation: The column on the right contains the name of the character set. This is the name that we use in other statements to indicate which character set must be applied. The second column contains a short description of each character set. The third column contains the default collating sequence of each character set. And on the complete right side you can find the maximum number of bytes that is reserved for a character. Note that this is 3 bytes for the last one.

In the SELECT statement, the column names, not a *, have been specified to make sure that the SELECT statement presents the columns in the same order as the SHOW statement.

All available collating sequences can be retrieved as well.

Example 17.2. Show the available collating sequences for the character set utf8.

SHOW COLLATION LIKE 'utf8%'

or

SELECT    *
FROM      INFORMATION_SCHEMA.COLLATIONS
WHERE     COLLATION_NAME LIKE 'utf8%'

The result is:

COLLATION           CHARSET   ID  DEFAULT  COMPILED  SORTLEN
------------------  -------  ---  -------  --------  -------
utf8_general_ci     utf8      33  Yes      Yes             1
utf8_bin            utf8      83           Yes             1
utf8_unicode_ci     utf8     192           Yes             8
utf8_icelandic_ci   utf8     193           Yes             8
utf8_latvian_ci     utf8     194           Yes             8
utf8_romanian_ci    utf8     195           Yes             8
utf8_slovenian_ci   utf8     196           Yes             8
utf8_polish_ci      utf8     197           Yes             8
utf8_estonian_ci    utf8     198           Yes             8
utf8_spanish_ci     utf8     199           Yes             8
utf8_swedish_ci     utf8     200           Yes             8
utf8_turkish_ci     utf8     201           Yes             8
utf8_czech_ci       utf8     202           Yes             8
utf8_danish_ci      utf8     203           Yes             8
utf8_lithuanian_ci  utf8     204           Yes             8
utf8_slovak_ci      utf8     205           Yes             8
utf8_spanish2_ci    utf8     206           Yes             8
utf8_roman_ci       utf8     207           Yes             8
utf8_persian_ci     utf8     208           Yes             8

Explanation: The column on the left contains the names of the collating sequences that we can use in SQL statements. The second column contains the name of the character set to which the collating sequence belongs. ID contains a unique number of the sequence. The DEFAULT column indicates whether the collating sequence is the default for this character set. The last two columns contain technical information.

Assigning Character Sets to Columns

Each alphanumeric column has a character set. When a table is created, a character set can explicitly be assigned to each column. For this, a data type option is used.

Example 17.3. Create a new table with two alphanumeric columns, and assign the character set ucs2 to both.

CREATE TABLE TABUCS2
     (C1 CHAR(10)    CHARACTER SET ucs2 NOT NULL PRIMARY KEY,
      C2 VARCHAR(10) CHARACTER SET ucs2)

Explanation: The character set is included as data type option and is, therefore, placed after the data type and in front of the null specification and primary key. The name of the character set may be entered in uppercase or lowercase letters. The name may also be specified as an alphanumeric literal. CHARACTER SET may be abbreviated to CHAR SET or CHARSET.

Columns belonging to the same table can have different character sets. This can be useful for the registration of, for example, a company name in different languages.

If a character set has not explicitly been defined for a column, the default character set is used.

Example 17.4. Create a new table with two alphanumeric columns, do not assign a character set, and look in the catalog tables next to see what the default character set is.

CREATE TABLE TABDEFKARSET
      (C1 CHAR(10) NOT NULL,
       C2 VARCHAR(10))

SELECT   COLUMN_NAME, CHARACTER_SET_NAME
FROM     INFORMATION_SCHEMA.COLUMNS
WHERE    TABLE_NAME = 'TABDEFKARSET'

The result is:

COLUMN_NAME  CHARACTER_SET_NAME
-----------  ------------------
C1           latin1
C2           latin1

The default character set is latin 1 for both columns. But where exactly has that default been defined? A default character set can be defined on two levels, on the table and on the database level. With a so-called table option, a default character set can be defined for a table.

Example 17.5. Create a new table with two alphanumeric columns, and define utf8 as the default character set.

CREATE TABLE TABUTF8
       (C1 CHAR(10) NOT NULL,
        C2 VARCHAR(10))
   DEFAULT CHARACTER SET utf8

SELECT    COLUMN_NAME, CHARACTER_SET_NAME
FROM      INFORMATION_SCHEMA.COLUMNS
WHERE     TABLE_NAME = 'TABUTF8'

The result is:

COLUMN_NAME  CHARACTER_SET_NAME
-----------  ------------------
C1           utf8
C2           utf8

If no default character set has been defined for a table, SQL checks whether one has been defined on the database level.

Each created database has a default character set, which is latin 1 if nothing has been specified. In Chapter 22, “Creating Databases,” we show how this default character set can be specified and changed.

Character sets that have been assigned once explicitly do not change when you change the default of the table or database later.

Exercise 17.1:

Are the internal byte codes of two characters, belonging to the same character set but with different collating sequences, equal?

Exercise 17.2:

Show the SELECT statement with which the number of collating sequences for each character set can be determined.

Assigning Collating Sequences to Columns

Each column should also have a collating sequence. If it has not been specified, SQL uses the default collating sequence that belongs to the character set. The next example shows how such a default collating sequence of a character set can be retrieved.

Example 17.6. Get the collating sequences of the columns of the tables that were created in Examples 17.3 and 17.4.

SELECT   TABLE_NAME, COLUMN_NAME, COLLATION_NAME
FROM     INFORMATION_SCHEMA.COLUMNS
WHERE    TABLE_NAME IN ('TABUCS2', 'TABDEFKARSET')

The result is:

TABLE_NAME    COLUMN_NAME  COLLATION_NAME
------------  -----------  -----------------
tabdefkarset  C1           latin1_swedish_ci
tabdefkarset  C2           latin1_swedish_ci
tabucs2       C1           ucs2_general_ci
tabucs2       C2           ucs2_general_ci

Of course, it is possible to specify explicitly a collating sequence with the data type option COLLATE.

Example 17.7. Create a new table with two alphanumeric columns, define utf8 as the character set, and use two different collating sequences.

CREATE TABLE TABCOLLATE
       (C1 CHAR(10)
           CHARACTER SET utf8
           COLLATE utf8_romanian_ci NOT NULL,
        C2 VARCHAR(10)
           CHARACTER SET utf8
           COLLATE utf8_spanish_ci)

SELECT    COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
FROM      INFORMATION_SCHEMA.COLUMNS
WHERE     TABLE_NAME = 'TABCOLLATE'

The result is:

COLUMN_NAME  CHARACTER_SET_NAME  COLLATION_NAME
-----------  ------------------  ----------------
C1           utf8                utf8_romanian_ci
C2           utf8                utf8_spanish_ci

ExplanationThe name of the collating sequence may also be written in uppercase letters and may be placed between brackets. If a character set and a collating sequence are specified, the character set should go first.

If all alphanumeric columns of a table need to have the same collating sequence, a default collating sequence can be defined for the entire table. Even though the character sets have their own collating sequences, that of the table still has priority.

Example 17.8. Create a new table with two alphanumeric columns, and define utf8 as character set and utf8_romanian_ci as the collating sequence.

CREATE TABLE TABDEFCOL
       (C1 CHAR(10) NOT NULL,
        C2 VARCHAR(10))
   CHARACTER SET = utf8
   COLLATE = utf8_romanian_ci

SELECT   COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
FROM     INFORMATION_SCHEMA.COLUMNS
WHERE    TABLE_NAME = 'TABDEFCOL'

The result is:

COLUMN_NAME  CHARACTER_SET_NAME  COLLATION_NAME
-----------  ------------------  ----------------
C1           utf8                utf8_romanian_ci
C2           utf8                utf8_romanian_ci

It is also possible to specify a default collating sequence on the database level; see Chapter 22.

Expressions with Character Sets and Collating Sequences

The character set and the collating sequence play a big part in the processing of alphanumeric expressions. Especially when making comparisons and sorting data, SQL must include the character sets and collating sequence of the expressions concerned. It is not allowed to compare two alphanumeric values belonging to two different collating sequences. We can conclude that two expressions with two different character sets cannot be compared, either, because, by definition, they have different collating sequences.

Example 17.9. Create a new table with two columns based upon different character sets.

CREATE TABLE TWOCHARSETS
       (C1 CHAR(10) CHARACTER SET 'latin1' NOT NULL,
        C2 VARCHAR(10) CHARACTER SET 'utf8')

INSERT INTO TWOCHARSETS VALUES ('A', 'A')

SELECT   *
FROM     TWOCHARSETS
WHERE    C1 = C2

SQL returns an error message when processing this SELECT.

Example 17.10. Create a new table with two columns based upon the same character set, but with different collating sequences.

CREATE TABLE TWOCOLL
       (C1 CHAR(10) COLLATE 'latin1_general_ci' NOT NULL,
        C2 VARCHAR(10) COLLATE 'latin1_danish_ci')

INSERT INTO TWOCOLL VALUES ('A', 'A')

SELECT   *
FROM     TWOCOLL
WHERE    C1 = C2

ExplanationBoth the columns C1 and C2 have the character set latin 1 (the default of the database), but their collating sequences differ; as a result, comparisons, such as in the earlier SELECT statement, will lead to error messages.

To compare two values with different collating sequences, you could change the collating sequence of one. Specify the term COLLATE behind the column concerned then, followed by the name of the sequence:

SELECT   *
FROM     TWOCOLL
WHERE    C1 COLLATE latin1_danish_ci = C2

Because of this, the definition of the alphanumeric expression is extended somewhat:

<alphanumeric expression> ::=
   <alphanumeric scalar expression> |
   <alphanumeric row expression>    |
   <alphanumeric table expression>

<alphanumeric scalar expression> ::=
   <singular alphanumeric scalar expression>
Explanation COLLATE <name> |
   <compound alphanumeric scalar expression>

<alphanumeric singular scalar expression> ::=
   _<collating sequence name> <alphanumeric 
Explanationliteral>      |
   <alphanumeric column specification>            
Explanation         |
   <alphanumeric user variable>                   
Explanation         |
   <alphanumeric system variable>                 
Explanation         |
   <alphanumeric cast expression>                 
Explanation         |
   <alphanumeric case expression>                 
Explanation         |
   NULL                                           
Explanation         |
   ( <alphanumeric scalar expression> )           
Explanation         |
   <alphanumeric scalar function>                 
Explanation         |
   <alphanumeric aggregation function>            
Explanation         |
   <alphanumeric scalar subquery>

It is obvious that you can specify only a collating sequence that belongs to the character set of the column or expression. The following statement also returns an error message because utf8_general_ci is not a collating sequence that belongs to latin1:

SELECT   *
FROM     TWOCOLL
WHERE    C1 COLLATE utf8_general_ci = C2

What exactly is the character set of an alphanumeric literal? If nothing is specified, that is the default character set of the database. If you want to assign a literal another character set, you should place the name of the character set in front of the literal. And in front of that name, the underscore symbol must be placed.

Example 17.11. Present the word database in the utf8 character set.

SELECT _utf8'database'

To retrieve the collating sequence of a certain expression, the COLLATE function has been added.

Example 17.12. Get the collating sequence of the expressions _utf8'database', _utf8'database' COLLATE utf8_bin, and of the NAME column of the PLAYERS table.

SELECT   COLLATION(_utf8'database'),
         COLLATION(_utf8'database' COLLATE utf8_bin),
         COLLATION((SELECT MAX(NAME) FROM PLAYERS))

The result is:

COLLATION(_utf8'database')  COLLATION(...)  COLLATION(...)
--------------------------  --------------  --------------
utf8_general_ci             utf8_bin
latin1_swedish_ci

With the CHARSET function, you retrieve the character set.

Example 17.13. Get the character sets of the expression _utf8'database' and of the NAME column of the PLAYERS table.

SELECT    CHARSET(_utf8'database'),
          CHARSET((SELECT MAX(NAME) FROM PLAYERS))

The result is:

CHARSET(_utf8'database')  CHARSET((...))
------------------------  --------------
utf8                      latin1

Exercise 17.3:

How does a comparison look in which two alphanumeric expressions with different collating sequences are compared on the basis of a third collating sequence?

Sorting and Grouping with Collating Sequences

COLLATE may also be used in ORDER BY clauses to specify a sorting on another collating sequence.

Example 17.14. Sort the two names Muller and Müller with two different collating sequences: latin1_swedish_ci and latin1_german2_ci.

SELECT _latin1 'Muller' AS NAME
UNION
SELECT CONCAT('M', _latin1 x'FC', 'ller')
ORDER BY NAME COLLATE latin1_swedish_ci

The result is:

NAME
------
Muller
Müller

ExplanationThe first select block returns the name Muller with the character set latin 1; the second select block returns the name Müller. When you change the collating sequence into latin1_german2_ci in this statement, the two rows are turned around in sequence, as the following result shows:

NAME
------
Müller
Muller

For the grouping of data, a check is done to see whether values in a column are equal. If that is the case, they are joined in one group. If the column contains alphanumeric values, the collating sequence plays a big part. In one collating sequence, two different characters can be seen as equal, while in another sequence they are considered to be unequal.

Example 17.15. Create a table in which the characters e, é, and ë are stored.

CREATE TABLE LETTERS
   (SEQNO  INTEGER NOT NULL PRIMARY KEY,
    LETTER CHAR(1) CHARACTER SET UTF8 NOT NULL)

INSERT INTO LETTERS VALUES (1, 'e'), (2, x'E9'),(3, x'EB')

SELECT   LETTER
FROM     (SELECT  LETTER COLLATE utf8_general_ci AS LETTER
          FROM    LETTERS) AS L
GROUP BY LETTER

The result is:

LETTER
------
e

ExplanationThe hexadecimal code of é is E9 and ë becomes EB. In the subquery, all letters are converted into the utf8_general_ci collating sequence. The result shows that é and ë are considered to be equal, while a separate group is formed for the character e. If you change the collating sequence, you get another result:

SELECT   LETTER
FROM    (SELECT   LETTER COLLATE utf8_swedish_ci AS L
         FROM     LETTERS) AS LETTER
GROUP BY LETTER

The result is:

LETTER
------
e

Now all three characters form one group and can be joined together. So be careful when you group and sort alphanumeric values when collating sequences are involved.

Exercise 17.4:

Determine what the character set and collating sequence is of the TOWN column in the PLAYERS table.

Exercise 17.5:

Sort the players on the basis of the TOWN column, but use another collating sequence than the one in the previous exercise.

The Coercibility of Expressions

For many expressions and statements, SQL can decide for itself which collating sequence must be used. For example, if you sort the values of a column or you compare a column with itself, the collating sequence of the relevant column will be used; see the following example.

Example 17.16. Use the LETTERS table from Example 17.15 and sort this table on the LETTER column.

SELECT   LETTER
FROM     LETTERS
ORDER BY LETTER

Which collating sequence will be used if you compare values that are of the same character set but that have different collating sequences? SQL solves this problem by means of the coercibility. Each expression has a coercibility value between 0 and 5. If two expressions are compared with different coercibility values, the collating sequence of the expression with the lowest coercibility value is selected. The rules for coercibility follow:

  • If an explicit collating sequence is assigned to an expression, the coercibility is equal to 0.

  • The concatenation of two alphanumeric expressions with different collating sequences gives a coercibility that is equal to 1.

  • The coercibility of a column specification is 2.

  • The value of functions such as USER() and VERSION() has a coercibility of 3.

  • The coercibility of an alphanumeric literal is 4.

  • The NULL value of an expression that has NULL as result has 5 as coercibility.

For the comparison COLUMN1 = 'e', the column specification COLUMN1 has a coercibility of 2 and that the literal has a coercibility of 4. This implies that SQL will use the collating sequence of the column specification.

You can retrieve the coercibility of an expression with the COERCIBILITY function.

Example 17.17. Get the coercibility value of several expressions.

SELECT   COERCIBILITY('Rick' COLLATE latin1_general_ci) AS C0,
         COERCIBILITY(TEAMNO) AS C2,
         COERCIBILITY(USER()) AS C3,
         COERCIBILITY('Rick') AS C4,
         COERCIBILITY(NULL) AS C5
FROM     TEAMS
WHERE    TEAMNO = 1

The result is:

C0  C2  C3  C4  C5
--  --  --  --  --
 0   2   3   4   5

Related System Variables

Various system variables have a relationship with character sets and collating sequences. Table 17.1 contains their names and the corresponding explanations.

Table 17.1. System Variables for Character Sets and Collating Sequences

System Variable

Explanation

CHARACTER_SET_CLIENT

The character set of the statements that are sent from the client to the server.

CHARACTER_SET_CONNECTION

The character set of the client/server connection.

CHARACTER_SET_DATABASE

The default character set of the current database. The value of this variable can change every time the USE statement is used to “jump” to another database. If no current database exists, this variable has the value of the CHARACTER_SET_SERVER variable.

CHARACTER_SET_RESULTS

The character set of the end results of SELECT statements that are sent from the server to the client.

CHARACTER_SET_SERVER

The default character set of the server.

CHARACTER_SET_SYSTEM

The character set of the system. This character set is used for the names of database objects, such as tables and columns, but also for the names of functions that are stored in the catalog tables. The value of this variable is always equal to utf8.

CHARACTER_SET_DIR

The name of the directory in which the files with all the character sets are registered.

COLLATION_CONNECTION

The character set of the present connection.

COLLATION_DATABASE

The default collating sequence of the current database. The value of this variable can change every time the USE statement is used to “jump” to another database. If no current database exists, this variable has the value of the COLLATION_SERVER variable.

COLLATION_SERVER

The default collating sequence of the server.

Besides CHARACTER_SET_DIR, the value of each of these system variables can be retrieved with the help of “at” symbols within SQL statements.

Example 17.18. Give the value of the default collating sequence of the current database.

SELECT @@COLLATION_DATABASE

The result is:

@@COLLATION_DATABASE
--------------------
latin1_swedish_ci

Example 17.19. Give the values of the system variables whose name begins with CHARACTER_SET.

SHOW VARIABLES LIKE 'CHARACTER_SET%'

The result is:

VARIABLE_NAME              VALUE
————————————  —————————————————-
character_set_client       latin1
character_set_connection   latin1
character_set_database     latin1
character_set_results      latin1
character_set_server       latin1
character_set_system       utf8
character_sets_dir         C:Program FilesMySQLMySQL Server
                           5.0sharecharsets/

Answers

17.1

The internal byte codes are not equal then.

17.2

SELECT   COLLATION_NAME, COUNT(*)
FROM     INFORMATION_SCHEMA.COLLATIONS
GROUP BY COLLATION_NAME

17.3

EXPRESSION1 COLLATE utf8 = EXPRESSION2 COLLATE ut8

17.4

SELECT CHARSET((SELECT MAX(TOWN) FROM PLAYERS)),
       COLLATION((SELECT MAX(TOWN) FROM PLAYERS))

17.5

SELECT   TOWN
FROM     PLAYERS
ORDER BY TOWN COLLATE latin1_danish_ci
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.196.175