APPENDIX A

What is UTF?

UTF is an abbreviation for UCS Transformation Format. UCS is an abbreviation for Universal Character Set. The Universal Character Set is synchronized with the unicode standard. There are three commonly known types of UTF encodings, namely UTF-8, UTF-16 and UTF-32.

The UTF-8 encodes unicode characters into a sequence of 8-bit values known as code units. In UTF-8 the encoding unit is 8-bits long. Similarly UTF-16 and UTF-32 each use 16 and 32 bits for encoding the unicode characters.

There are over a million characters included in the current version of unicode Standard (v5.2.0 is the standard at the time of writing this book). The valid range of code points for the unicode characters is from 0 to 10FFFF (in Hex). Out of this range of code points the values in the range from D800 to DFFF are reserved for creating surrogate pairs and is not assigned to any abstract characters. The range D800–DBFF is for High Surrogate and DC00–DFFF is for low surrogates. The surrogates are used for encoding supplementary characters using UTF-16 encoding as will be discussed later in this appendix. Let us look at how UTF-8 encoding is done.

In UTF-8 encoding a single unicode character is encoded into multiple octets depending on the value of the character being encoded. The following table shows the number of bytes or (code-units) used for encoding the characters in the different code point ranges:

Code point values (Hex) No. of code units (bytes)
0–7F 1
80–7FF 2
800–FFFF 3
10000–10FFFF 4

For 7-bit values (0–7F), the code points are encoded and stored in a single byte with the value of the code point, i.e. a single byte with a 0 bit followed by the 7 bits of the code point value. The 7 bits to be encoded are encoded as shown below:

1st byte = 0xxxxxxx where xxxxxxx are the 7-bits to be encoded.

For 11-bit values (80–7FF), the code points are encoded into 2 bytes where the first byte has the most significant 3-bits set as 110 followed by the most significant 5 bits of the code point value, and the second byte has the initial 2-bits set as 10 followed by the least significant 6 bits of the 11-bit code point value. The 110 in the most significant 3 bits of the first byte indicates that it is the first byte of a 2-byte encoding, and the 10 in the most significant 2 bits of the second byte indicates that it is a continuation byte. The 11 bits to be encoded are now encoded as follows:

1st byte = 110mmmmm where mmmmm are the most significant 5 bits (bits 10–6) from the 11-bits to be encoded, 2nd byte = 10nnnnnn where nnnnnn are the remaining 6 bits (bits 5–0) to be encoded.

For 16-bit value (800–FFFF), the code-points are encoded into 3 bytes where the first byte has the most significant 4-bits set as 1110 followed by the most significant 4 bits of the code point value, the second byte has the initial 2-bits set as 10 followed by the next 6-bits of the 16-bit code point value, and the third byte has the initial 2-bits set as 10 followed by the least significant 6-bits of the 16-bit code point value. The 1110 in the most significant 4 bits of the first byte indicates that it is the first byte of a 3-byte encoding and the next two bytes have the initial 2-bits set as 10 to indicate that these are the continuation bytes. The 16 bits to be encoded are now encoded as follows:

1st byte = 1110wwww where wwww are the most significant 4 bits (bits 15–12) from the 16-bits to be encoded, 2nd byte = 10xxxxxx where xxxxxx are the next 6 bits (bits 11–6) to be encoded, 3rd byte = 10yyyyyy where yyyyyy are the remaining 6 bits (bits 5–0) to be encoded.

For 21-bit values (10000–10FFFF), the code points are encoded into 4 bytes where the first byte has the most significant 5-bits set as 11110 followed by the most significant 3 bits of the code point value, the second byte has the initial 2-bits set as 10 followed by the next 6-bits of the 21-bit code point value, the third byte has the initial 2-bits set as 10 followed by the next 6-bits of the 21-bit code point value and the fourth byte has 10 followed by the least significant 6-bits of the 21-bit code point value. The 11110 in the most significant 5 bits of the first byte inidicates that it is the first byte of a 4-byte encoding and the next three bytes have the initial 2-bits set as 10 to indicate that these are the continuation bytes. The 21 bits to encoded are not encoded as follows:

1st byte = 11110www where www are the most significant 3 bits (bits 20–18) from the 21-bits to be encoded, 2nd byte = 10xxxxxx where xxxxxx are the next 6 bits (bits 17–12) to be encoded, 3rd byte = 10yyyyyy where yyyyyy are the next 6 bits (bits 11–6) to be encoded, 4th byte = 10zzzzzz where zzzzzz are the remaining 6 bits (bits 5–0) to be encoded.

This shows how unicode characters are encoded using the standard UTF-8 encoding.

There is a variation to the standard UTF-8 encoding called the modified UTF-8. This variation is used in Java by the writeUTF, and readUTF methods appearing in the DataOutputStream, DataInputStream and the RandomAccessFile classes. According to the variation when encoding a zero, it is encoded into 2 bytes using the 11-bit encoding which results in the 2 bytes 11000000 10000000 (C0 80 Hex). This is done to ensure that all bits = 0 are not valid bytes in this encoding. Another change is with regard to the values in the fourth range. While encoding the supplementary characters, any supplementary character will be available as 2 char values (high surrogate followed by low surrogate). This results in these characters getting encoded as 6 bytes each (3 bytes for high surrogate and 3 bytes for low surrogate).

Let us now look at the UTF-16 encoding.

In case of UTF-16 encoding the unicode characters get encoded as one 16-bit code unit or two 16-bit code units depending on the value of the code point. We know that the range of code points for the unicode characters is from 0 to 10FFFF (hex). The characters in the range from 0 to FFFF (hex) are encoded into a single 16-bit code unit retaining the value of the character as it is. The characters in the range from 0 to FFFF (hex) are the characters in the BMP (Basic Multi-lingual Plane). The supplementary characters which are in the range from 10000 (hex) to 10FFFF (hex) are encoded into two 16-bit values explained below.

To encode any code point in the range (10000–10FFFF), the first step is to subtract 100000 (hex) from the value of the code point. The resulting value would now be in the range from 0 to FFFFF (hex), which is a 20-bit number. Now these 20 bits are encoded into two 16-bit values as follows.

The first 16-bit code unit will have initial 6-bits set as 110110 followed by the most significant 10 bits of the 20-bit value arrived at by subtracting 10000 (Hex) from the initial code point value, and the second 16-bit code unit will have initial 6-bits set as 110111 followed by the least significant 10-bits of the 20-bit value arrived at by subtracting 10000 (Hex) from the initial code point value. So the 20-bits are encoded as shown below:

1st 16-bit code unit = 110110xxxxxxxxxx where xxxxxxxxxx are the most significant 10 bits (bits 19–10) from the 20-bits to be encoded, 2nd 16-bit = 110111yyyyyyyyyy where yyyyyyyyyy are the remaining 10 bits (bits 9–0) from the 20 bits to be encoded.

So the first code unit will be in the range from 1101100000000000 to 1101101111111111 (D800–DBFF) which is the range for high surrogates, and the second code unit will be in the range from 1101110000000000 to 1101111111111111 (DC00–DFFF), which is the range for low surrogates.

This shows how unicode characters are encoded using the UTF-16 encoding.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.216