In this book, the concepts character sets and collating sequence have been mentioned a few times. What these concepts exactly mean and how SQL handles them is the topic of this chapter.
In the various SQL products, support for character sets and collating sequences has been implemented very differently. For that reason, we describe the form implemented in MySQL.
To store alphanumeric values, characters such as A, b, and ë, but also special symbols such as {, &, and =, they must be converted into numeric codes. Something like a translation table must be built that contains a unique numeric code for each relevant character. Each character, therefore, gets a position in that translation table. Such a translation table is called a character set in the SQL world. In the literature, the terms code character set and character encoding are sometimes used as well.
For a character set, an encoding scheme must be invented. The character set indicates only that, for example, the uppercase letter A has the position or the code 41 and that the lowercase letter h has the position 68. But how will we store that in bytes? For each translation table, several encoding schemes can be invented; the more creative you are, the more schemes you can create. At first, you always think of a fixed number of bits and bytes for each character. So, for the storage of a character set consisting of a maximum of 256 characters, you can decide to reserve 8 bits for each character. But you could also decide to use flexible storage. For characters that occur frequently, you reserve, for example, 4 bits; for the others, you reserve 8 or 12.
Flexible lengths are also used in Morse code. In Morse, letters are represented by dots and dashes. However, not every letter has the same number of dots or dashes. The letter e, for example, is only one dash, whereas the c is built up from a dash, a dot, a dash and finally a dot—four symbols for one letter. You can also use such a solution as an encoding scheme for a character set.
So, the encoding scheme contains information about how positions such as 1, 100, and 1,000 are stored on hard disk or in internal memory.
In SQL the concepts of character sets and encoding scheme are seen as synonyms. A character set is a combination of a translation table with an encoding scheme.
Through the years, many character sets have been invented. The first standardized character set was ASCII (American Standard Code for Information Interchange), and its first version was defined in 1960 by ANSI. Another well-known character set is EBCDIC (Extended Binary Coded Decimal Interchange Code), invented by IBM. It has been the standard on the IBM mainframes for a long time.
With ASCII, the number of characters was limited to a maximum of 256 (28) characters. That used to be enough, but nowadays applications and their users require much more. Applications must be capable of handling special letters, such as β, Đ, Œ, and æ. Also, letters with all sorts of accents, such as ş, ũ, ų, and š must be processed. And then we have not even mentioned the languages in which other letters are used. Think about languages from the Middle East and the Far East. In short, 256 positions are no longer sufficient. It was time to come up with character sets that could be used to code thousands of different characters. Unicode (short for Universal Code) is one of the most used new character sets, but there are more that can hold large sets of characters.
Unicode has different encoding schemes, including UTF-8, UTF-16, and UTF-32. UTF stands for Unicode Transformation Format. These encoding schemes vary in the number of different characters they can handle and in the number of bytes they reserve for certain characters.
The concept of collating sequence deals with the sort order or the grouping of the characters. If numeric values are sorted or compared, it is always obvious how that must be done. The number 10 is smaller than the number 100, so 10 comes before 100 when sorted. Sorting and comparing alphanumeric values is not always that simple. If you have to place the words Monkey and monkey in alphabetical order, which one comes first, the spelling with the uppercase letter or the one with the lowercase letter? If you sort on the positions of the characters, the spelling with the uppercase letters will come first with character sets such as ASCII and Unicode. But is that what you want? And if that is what you want, does that mean that a user living in Georgia wants that as well? It becomes even more difficult when you want to sort the Dutch words scène, schaaf, and scepter. It is only on the third letter that these words are different. When you look at the ASCII codes for these three letters, scepter comes first, then schaaf, and finally scène sorted last. However, most users would like to see scepter and scène behind each other. But then the question is, which of these two comes first? To throw some light on this, the collating sequence has been added. For example, if a character set is assigned to a column, a collating sequence can be specified. For one character set, several collating sequences can be relevant. A collating sequence always belongs to only one character set.
During the installation of SQL, a number of character sets are introduced. This list can be retrieved by using a special SHOW
statement or by querying a catalog table.
Example 17.1. Show the available character sets.
SHOW CHARACTER SET
or
SELECT CHARACTER_SET_NAME, DESCRIPTION, DEFAULT_COLLATE_NAME, MAXLEN FROM INFORMATION_SCHEMA.CHARACTER_SETS
The result is:
CHARSET DESCRIPTION DEFAULT COLLATION MAXLEN ------- --------------------------- ----------------- ------ big5 Big5 Traditional Chinese big5_chinese_ci 2 dec8 DEC West European dec8_swedish_ci 1 cp850 DOS West European cp850_general_ci 1 hp8 HP West European hp8_english_ci 1 koi8r KOI8-R Relcom Russian koi8r_general_ci 1 latin1 ISO 8859-1 West European latin1_swedish_ci 1 latin2 ISO 8859-2 Central European latin2_general_ci 1 swe7 7bit Swedish swe7_swedish_ci 1 ascii US ASCII ascii_general_ci 1 : utf8 UTF-8 Unicode utf8_general_ci 3 ucs2 UCS-2 Unicode ucs2_general_ci 2 : cp932 SJIS for Windows Japanese cp932_japanese_ci 2 eucjpms UJIS for Windows Japanese eucjpms_japanese_ci 3
Explanation: The column on the right contains the name of the character set. This is the name that we use in other statements to indicate which character set must be applied. The second column contains a short description of each character set. The third column contains the default collating sequence of each character set. And on the complete right side you can find the maximum number of bytes that is reserved for a character. Note that this is 3 bytes for the last one.
In the SELECT
statement, the column names, not a *
, have been specified to make sure that the SELECT
statement presents the columns in the same order as the SHOW
statement.
All available collating sequences can be retrieved as well.
Example 17.2. Show the available collating sequences for the character set utf8.
SHOW COLLATION LIKE 'utf8%'
or
SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE 'utf8%'
The result is:
COLLATION CHARSET ID DEFAULT COMPILED SORTLEN ------------------ ------- --- ------- -------- ------- utf8_general_ci utf8 33 Yes Yes 1 utf8_bin utf8 83 Yes 1 utf8_unicode_ci utf8 192 Yes 8 utf8_icelandic_ci utf8 193 Yes 8 utf8_latvian_ci utf8 194 Yes 8 utf8_romanian_ci utf8 195 Yes 8 utf8_slovenian_ci utf8 196 Yes 8 utf8_polish_ci utf8 197 Yes 8 utf8_estonian_ci utf8 198 Yes 8 utf8_spanish_ci utf8 199 Yes 8 utf8_swedish_ci utf8 200 Yes 8 utf8_turkish_ci utf8 201 Yes 8 utf8_czech_ci utf8 202 Yes 8 utf8_danish_ci utf8 203 Yes 8 utf8_lithuanian_ci utf8 204 Yes 8 utf8_slovak_ci utf8 205 Yes 8 utf8_spanish2_ci utf8 206 Yes 8 utf8_roman_ci utf8 207 Yes 8 utf8_persian_ci utf8 208 Yes 8
Explanation: The column on the left contains the names of the collating sequences that we can use in SQL statements. The second column contains the name of the character set to which the collating sequence belongs. ID
contains a unique number of the sequence. The DEFAULT
column indicates whether the collating sequence is the default for this character set. The last two columns contain technical information.
Each alphanumeric column has a character set. When a table is created, a character set can explicitly be assigned to each column. For this, a data type option is used.
Example 17.3. Create a new table with two alphanumeric columns, and assign the character set ucs2 to both.
CREATE TABLE TABUCS2 (C1 CHAR(10) CHARACTER SET ucs2 NOT NULL PRIMARY KEY, C2 VARCHAR(10) CHARACTER SET ucs2)
Explanation: The character set is included as data type option and is, therefore, placed after the data type and in front of the null specification and primary key. The name of the character set may be entered in uppercase or lowercase letters. The name may also be specified as an alphanumeric literal. CHARACTER SET
may be abbreviated to CHAR SET
or CHARSET
.
Columns belonging to the same table can have different character sets. This can be useful for the registration of, for example, a company name in different languages.
If a character set has not explicitly been defined for a column, the default character set is used.
Example 17.4. Create a new table with two alphanumeric columns, do not assign a character set, and look in the catalog tables next to see what the default character set is.
CREATE TABLE TABDEFKARSET (C1 CHAR(10) NOT NULL, C2 VARCHAR(10)) SELECT COLUMN_NAME, CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'TABDEFKARSET'
The result is:
COLUMN_NAME CHARACTER_SET_NAME ----------- ------------------ C1 latin1 C2 latin1
The default character set is latin 1 for both columns. But where exactly has that default been defined? A default character set can be defined on two levels, on the table and on the database level. With a so-called table option, a default character set can be defined for a table.
Example 17.5. Create a new table with two alphanumeric columns, and define utf8 as the default character set.
CREATE TABLE TABUTF8 (C1 CHAR(10) NOT NULL, C2 VARCHAR(10)) DEFAULT CHARACTER SET utf8 SELECT COLUMN_NAME, CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'TABUTF8'
The result is:
COLUMN_NAME CHARACTER_SET_NAME ----------- ------------------ C1 utf8 C2 utf8
If no default character set has been defined for a table, SQL checks whether one has been defined on the database level.
Each created database has a default character set, which is latin 1 if nothing has been specified. In Chapter 22, “Creating Databases,” we show how this default character set can be specified and changed.
Character sets that have been assigned once explicitly do not change when you change the default of the table or database later.
Each column should also have a collating sequence. If it has not been specified, SQL uses the default collating sequence that belongs to the character set. The next example shows how such a default collating sequence of a character set can be retrieved.
Example 17.6. Get the collating sequences of the columns of the tables that were created in Examples 17.3 and 17.4.
SELECT TABLE_NAME, COLUMN_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME IN ('TABUCS2', 'TABDEFKARSET')
The result is:
TABLE_NAME COLUMN_NAME COLLATION_NAME ------------ ----------- ----------------- tabdefkarset C1 latin1_swedish_ci tabdefkarset C2 latin1_swedish_ci tabucs2 C1 ucs2_general_ci tabucs2 C2 ucs2_general_ci
Of course, it is possible to specify explicitly a collating sequence with the data type option COLLATE
.
Example 17.7. Create a new table with two alphanumeric columns, define utf8 as the character set, and use two different collating sequences.
CREATE TABLE TABCOLLATE (C1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_romanian_ci NOT NULL, C2 VARCHAR(10) CHARACTER SET utf8 COLLATE utf8_spanish_ci) SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'TABCOLLATE'
The result is:
COLUMN_NAME CHARACTER_SET_NAME COLLATION_NAME ----------- ------------------ ---------------- C1 utf8 utf8_romanian_ci C2 utf8 utf8_spanish_ci
Explanation. The name of the collating sequence may also be written in uppercase letters and may be placed between brackets. If a character set and a collating sequence are specified, the character set should go first.
If all alphanumeric columns of a table need to have the same collating sequence, a default collating sequence can be defined for the entire table. Even though the character sets have their own collating sequences, that of the table still has priority.
Example 17.8. Create a new table with two alphanumeric columns, and define utf8 as character set and utf8_romanian_ci as the collating sequence.
CREATE TABLE TABDEFCOL (C1 CHAR(10) NOT NULL, C2 VARCHAR(10)) CHARACTER SET = utf8 COLLATE = utf8_romanian_ci SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'TABDEFCOL'
The result is:
COLUMN_NAME CHARACTER_SET_NAME COLLATION_NAME ----------- ------------------ ---------------- C1 utf8 utf8_romanian_ci C2 utf8 utf8_romanian_ci
It is also possible to specify a default collating sequence on the database level; see Chapter 22.
The character set and the collating sequence play a big part in the processing of alphanumeric expressions. Especially when making comparisons and sorting data, SQL must include the character sets and collating sequence of the expressions concerned. It is not allowed to compare two alphanumeric values belonging to two different collating sequences. We can conclude that two expressions with two different character sets cannot be compared, either, because, by definition, they have different collating sequences.
Example 17.9. Create a new table with two columns based upon different character sets.
CREATE TABLE TWOCHARSETS (C1 CHAR(10) CHARACTER SET 'latin1' NOT NULL, C2 VARCHAR(10) CHARACTER SET 'utf8') INSERT INTO TWOCHARSETS VALUES ('A', 'A') SELECT * FROM TWOCHARSETS WHERE C1 = C2
SQL returns an error message when processing this SELECT
.
Example 17.10. Create a new table with two columns based upon the same character set, but with different collating sequences.
CREATE TABLE TWOCOLL (C1 CHAR(10) COLLATE 'latin1_general_ci' NOT NULL, C2 VARCHAR(10) COLLATE 'latin1_danish_ci') INSERT INTO TWOCOLL VALUES ('A', 'A') SELECT * FROM TWOCOLL WHERE C1 = C2
Explanation. Both the columns C1 and C2 have the character set latin 1 (the default of the database), but their collating sequences differ; as a result, comparisons, such as in the earlier SELECT
statement, will lead to error messages.
To compare two values with different collating sequences, you could change the collating sequence of one. Specify the term COLLATE
behind the column concerned then, followed by the name of the sequence:
SELECT * FROM TWOCOLL WHERE C1 COLLATE latin1_danish_ci = C2
Because of this, the definition of the alphanumeric expression is extended somewhat:
<alphanumeric expression> ::= <alphanumeric scalar expression> | <alphanumeric row expression> | <alphanumeric table expression> <alphanumeric scalar expression> ::= <singular alphanumeric scalar expression> COLLATE <name> | <compound alphanumeric scalar expression> <alphanumeric singular scalar expression> ::= _<collating sequence name> <alphanumeric literal> | <alphanumeric column specification> | <alphanumeric user variable> | <alphanumeric system variable> | <alphanumeric cast expression> | <alphanumeric case expression> | NULL | ( <alphanumeric scalar expression> ) | <alphanumeric scalar function> | <alphanumeric aggregation function> | <alphanumeric scalar subquery> |
It is obvious that you can specify only a collating sequence that belongs to the character set of the column or expression. The following statement also returns an error message because utf8_general_ci is not a collating sequence that belongs to latin1:
SELECT * FROM TWOCOLL WHERE C1 COLLATE utf8_general_ci = C2
What exactly is the character set of an alphanumeric literal? If nothing is specified, that is the default character set of the database. If you want to assign a literal another character set, you should place the name of the character set in front of the literal. And in front of that name, the underscore symbol must be placed.
To retrieve the collating sequence of a certain expression, the COLLATE function has been added.
Example 17.12. Get the collating sequence of the expressions _utf8'database'
, _utf8'database' COLLATE utf8_bin
, and of the NAME
column of the PLAYERS
table.
SELECT COLLATION(_utf8'database'), COLLATION(_utf8'database' COLLATE utf8_bin), COLLATION((SELECT MAX(NAME) FROM PLAYERS))
The result is:
COLLATION(_utf8'database') COLLATION(...) COLLATION(...) -------------------------- -------------- -------------- utf8_general_ci utf8_bin latin1_swedish_ci
With the CHARSET
function, you retrieve the character set.
Example 17.13. Get the character sets of the expression _utf8'database'
and of the NAME
column of the PLAYERS
table.
COLLATE
may also be used in ORDER BY
clauses to specify a sorting on another collating sequence.
Example 17.14. Sort the two names Muller and Müller with two different collating sequences: latin1_swedish_ci and latin1_german2_ci.
SELECT _latin1 'Muller' AS NAME UNION SELECT CONCAT('M', _latin1 x'FC', 'ller') ORDER BY NAME COLLATE latin1_swedish_ci
The result is:
NAME ------ Muller Müller
Explanation. The first select block returns the name Muller with the character set latin 1; the second select block returns the name Müller. When you change the collating sequence into latin1_german2_ci in this statement, the two rows are turned around in sequence, as the following result shows:
NAME ------ Müller Muller
For the grouping of data, a check is done to see whether values in a column are equal. If that is the case, they are joined in one group. If the column contains alphanumeric values, the collating sequence plays a big part. In one collating sequence, two different characters can be seen as equal, while in another sequence they are considered to be unequal.
Example 17.15. Create a table in which the characters e, é, and ë are stored.
CREATE TABLE LETTERS (SEQNO INTEGER NOT NULL PRIMARY KEY, LETTER CHAR(1) CHARACTER SET UTF8 NOT NULL) INSERT INTO LETTERS VALUES (1, 'e'), (2, x'E9'),(3, x'EB') SELECT LETTER FROM (SELECT LETTER COLLATE utf8_general_ci AS LETTER FROM LETTERS) AS L GROUP BY LETTER
The result is:
LETTER ------ e
Explanation. The hexadecimal code of é is E9 and ë becomes EB. In the subquery, all letters are converted into the utf8_general_ci collating sequence. The result shows that é and ë are considered to be equal, while a separate group is formed for the character e. If you change the collating sequence, you get another result:
SELECT LETTER FROM (SELECT LETTER COLLATE utf8_swedish_ci AS L FROM LETTERS) AS LETTER GROUP BY LETTER
The result is:
LETTER ------ e
Now all three characters form one group and can be joined together. So be careful when you group and sort alphanumeric values when collating sequences are involved.
For many expressions and statements, SQL can decide for itself which collating sequence must be used. For example, if you sort the values of a column or you compare a column with itself, the collating sequence of the relevant column will be used; see the following example.
Example 17.16. Use the LETTERS
table from Example 17.15 and sort this table on the LETTER
column.
SELECT LETTER FROM LETTERS ORDER BY LETTER
Which collating sequence will be used if you compare values that are of the same character set but that have different collating sequences? SQL solves this problem by means of the coercibility. Each expression has a coercibility value between 0 and 5. If two expressions are compared with different coercibility values, the collating sequence of the expression with the lowest coercibility value is selected. The rules for coercibility follow:
If an explicit collating sequence is assigned to an expression, the coercibility is equal to 0
.
The concatenation of two alphanumeric expressions with different collating sequences gives a coercibility that is equal to 1
.
The coercibility of a column specification is 2
.
The value of functions such as USER()
and VERSION()
has a coercibility of 3
.
The coercibility of an alphanumeric literal is 4
.
The NULL
value of an expression that has NULL
as result has 5
as coercibility.
For the comparison COLUMN1 = 'e'
, the column specification COLUMN1
has a coercibility of 2
and that the literal has a coercibility of 4
. This implies that SQL will use the collating sequence of the column specification.
You can retrieve the coercibility of an expression with the COERCIBILITY
function.
Example 17.17. Get the coercibility value of several expressions.
SELECT COERCIBILITY('Rick' COLLATE latin1_general_ci) AS C0, COERCIBILITY(TEAMNO) AS C2, COERCIBILITY(USER()) AS C3, COERCIBILITY('Rick') AS C4, COERCIBILITY(NULL) AS C5 FROM TEAMS WHERE TEAMNO = 1
The result is:
C0 C2 C3 C4 C5 -- -- -- -- -- 0 2 3 4 5
Various system variables have a relationship with character sets and collating sequences. Table 17.1 contains their names and the corresponding explanations.
Table 17.1. System Variables for Character Sets and Collating Sequences
System Variable | Explanation |
---|---|
| The character set of the statements that are sent from the client to the server. |
| The character set of the client/server connection. |
| The default character set of the current database. The value of this variable can change every time the |
| The character set of the end results of |
| The default character set of the server. |
| The character set of the system. This character set is used for the names of database objects, such as tables and columns, but also for the names of functions that are stored in the catalog tables. The value of this variable is always equal to |
| The name of the directory in which the files with all the character sets are registered. |
| The character set of the present connection. |
| The default collating sequence of the current database. The value of this variable can change every time the |
| The default collating sequence of the server. |
Besides CHARACTER_SET_DIR
, the value of each of these system variables can be retrieved with the help of “at” symbols within SQL statements.
Example 17.18. Give the value of the default collating sequence of the current database.
SELECT @@COLLATION_DATABASE
The result is:
@@COLLATION_DATABASE -------------------- latin1_swedish_ci
Example 17.19. Give the values of the system variables whose name begins with CHARACTER_SET
.
SHOW VARIABLES LIKE 'CHARACTER_SET%'
The result is:
VARIABLE_NAME VALUE ———————————— —————————————————- character_set_client latin1 character_set_connection latin1 character_set_database latin1 character_set_results latin1 character_set_server latin1 character_set_system utf8 character_sets_dir C:Program FilesMySQLMySQL Server 5.0sharecharsets/
18.222.196.175