Chapter 4. Text versus Bytes

Humans use text. Computers speak bytes.1

Esther Nam and Travis Fischer, Character Encoding and Unicode in Python

Python 3 introduced a sharp distinction between strings of human text and sequences of raw bytes. Implicit conversion of byte sequences to Unicode text is a thing of the past. This chapter deals with Unicode strings, binary sequences, and the encodings used to convert between them.

Depending on your Python programming context, a deeper understanding of Unicode may or may not be of vital importance to you. In the end, most of the issues covered in this chapter do not affect programmers who deal only with ASCII text. But even if that is your case, there is no escaping the str versus byte divide. As a bonus, you’ll find that the specialized binary sequence types provide features that the “all-purpose” Python 2 str type does not have.

In this chapter, we will visit the following topics:

  • Characters, code points, and byte representations

  • Unique features of binary sequences: bytes, bytearray, and memoryview

  • Codecs for full Unicode and legacy character sets

  • Avoiding and dealing with encoding errors

  • Best practices when handling text files

  • The default encoding trap and standard I/O issues

  • Safe Unicode text comparisons with normalization

  • Utility functions for normalization, case folding, and brute-force diacritic removal

  • Proper sorting of Unicode text with locale and the PyUCA library

  • Character metadata in the Unicode database

  • Dual-mode APIs that handle str and bytes

  • Building emojis from character combinations

What’s new in this chapter

Support for Unicode in Python 3 has been comprehensive and stable for a while, so the biggest change in this chapter is a new section on emojis—not because of changes in Python, but because of the growing popularity of emojis and emoji combinations. Unicode 13, released in 2020, supports more than 3000 emojis, and many of them are built by combining Unicode characters. “Multi-character emojis” explains.

Also new in this 2nd edition is “Finding characters by name” including source code for utility for searching the Unicode database, a great way to find circled digits and smiling cats from the command-line.

A minor change worth mentioning is the Unicode support on Windows, which is better and simpler since Python 3.6, as we’ll see in “Beware of Encoding Defaults”.

Let’s start with the not-so-new, but fundamental concepts of characters, code points, and bytes.

Character Issues

The concept of “string” is simple enough: a string is a sequence of characters. The problem lies in the definition of “character.”

In 2020, the best definition of “character” we have is a Unicode character. Accordingly, the items you get out of a Python 3 str are Unicode characters, just like the items of a unicode object in Python 2—and not the raw bytes you get from a Python 2 str.

The Unicode standard explicitly separates the identity of characters from specific byte representations:

  • The identity of a character—its code point—is a number from 0 to 1,114,111 (base 10), shown in the Unicode standard as 4 to 6 hex digits with a “U+” prefix, from U+0000 to U+10FFFF. For example, the code point for the letter A is U+0041, the Euro sign is U+20AC, and the musical symbol G clef is assigned to code point U+1D11E. About 12% of the valid code points have characters assigned to them in Unicode 12.1, the standard used in Python 3.8.

  • The actual bytes that represent a character depend on the encoding in use. An encoding is an algorithm that converts code points to byte sequences and vice versa. The code point for the letter A (U+0041) is encoded as the single byte x41 in the UTF-8 encoding, or as the bytes x41x00 in UTF-16LE encoding. As another example, UTF-8 requires three bytes—xe2x82xac—to encode the Euro sign (U+20AC) but in UTF-16LE the same code point is encoded as two bytes: xacx20.

Converting from code points to bytes is encoding; converting from bytes to code points is decoding. See Example 4-1.

Example 4-1. Encoding and decoding
>>> s = 'café'
>>> len(s)  1
4
>>> b = s.encode('utf8')  2
>>> b
b'cafxc3xa9'  3
>>> len(b)  4
5
>>> b.decode('utf8')  5
'café'
1

The str 'café' has four Unicode characters.

2

Encode str to bytes using UTF-8 encoding.

3

bytes literals have a b prefix.

4

bytes b has five bytes (the code point for “é” is encoded as two bytes in UTF-8).

5

Decode bytes to str using UTF-8 encoding.

Tip

If you need a memory aid to help distinguish .decode() from .encode(), convince yourself that byte sequences can be cryptic machine core dumps while Unicode str objects are “human” text. Therefore, it makes sense that we decode bytes to str to get human-readable text, and we encode str to bytes for storage or transmission.

Although the Python 3 str is pretty much the Python 2 unicode type with a new name, the Python 3 bytes is not simply the old str renamed, and there is also the closely related bytearray type. So it is worthwhile to take a look at the binary sequence types before advancing to encoding/decoding issues.

Byte Essentials

The new binary sequence types are unlike the Python 2 str in many regards. The first thing to know is that there are two basic built-in types for binary sequences: the immutable bytes type introduced in Python 3 and the mutable bytearray, added in Python 2.6.2

Each item in bytes or bytearray is an integer from 0 to 255, and not a one-character string like in the Python 2 str. However, a slice of a binary sequence always produces a binary sequence of the same type—including slices of length 1. See Example 4-2.

Example 4-2. A five-byte sequence as bytes and as bytearray
>>> cafe = bytes('café', encoding='utf_8')  1
>>> cafe
b'cafxc3xa9'
>>> cafe[0]  2
99
>>> cafe[:1]  3
b'c'
>>> cafe_arr = bytearray(cafe)
>>> cafe_arr  4
bytearray(b'cafxc3xa9')
>>> cafe_arr[-1:]  5
bytearray(b'xa9')
1

bytes can be built from a str, given an encoding.

2

Each item is an integer in range(256).

3

Slices of bytes are also bytes—even slices of a single byte.

4

There is no literal syntax for bytearray: they are shown as bytearray() with a bytes literal as argument.

5

A slice of bytearray is also a bytearray.

Warning

The fact that my_bytes[0] retrieves an int but my_bytes[:1] returns a bytes object of length 1 may be surprising, and makes it harder to support both Python 2.7 and 3 in programs that deal with binary data. But it is consistent with many other languages and also with other Python sequence types—except for str, which is the only sequence type where s[0] == s[:1].

Although binary sequences are really sequences of integers, their literal notation reflects the fact that ASCII text is often embedded in them. Therefore, three different displays are used, depending on each byte value:

  • For bytes in the printable ASCII range—from space to ~—the ASCII character itself is used.

  • For bytes corresponding to tab, newline, carriage return, and , the escape sequences , , , and \ are used.

  • If both string delimiters ' and " appear in the byte sequence, the ' the whole sequence is delimited by ' and any ' inside are escaped as '.3

  • For other byte values, a hexadecimal escape sequence is used (e.g., x00 is the null byte).

That is why in Example 4-2 you see b'cafxc3xa9': the first three bytes b'caf' are in the printable ASCII range, the last two are not.

Both bytes and bytearray support every str method except those that do formatting (format, format_map) and a few others that depend on Unicode data, including casefold, isdecimal, isidentifier, isnumeric, isprintable, and encode. This means that you can use familiar string methods like endswith, replace, strip, translate, upper, and dozens of others with binary sequences—only using bytes and not str arguments. In addition, the regular expression functions in the re module also work on binary sequences, if the regex is compiled from a binary sequence instead of a str. Since Python 3.5, the % operator works with binary sequences again.4

Binary sequences have a class method that str doesn’t have, called fromhex, which builds a binary sequence by parsing pairs of hex digits optionally separated by spaces:

>>> bytes.fromhex('31 4B CE A9')
b'1Kxcexa9'

The other ways of building bytes or bytearray instances are calling their constructors with:

  • A str and an encoding keyword argument.

  • An iterable providing items with values from 0 to 255.

  • An object that implements the buffer protocol (e.g., bytes, bytearray, memoryview, array.array); this copies the bytes from the source object to the newly created binary sequence.

Warning

Until Python 3.5, it was also possible to call bytes or bytearray with a single integer to create a binary sequence of that size initialized with null bytes. This signature was deprecated in Python 3.5 and removed in Python 3.6. See PEP 467 — Minor API improvements for binary sequences.)

Building a binary sequence from a buffer-like object is a low-level operation that may involve type casting. See a demonstration in Example 4-3.

Example 4-3. Initializing bytes from the raw data of an array
>>> import array
>>> numbers = array.array('h', [-2, -1, 0, 1, 2])  1
>>> octets = bytes(numbers)  2
>>> octets
b'xfexffxffxffx00x00x01x00x02x00'  3
1

Typecode 'h' creates an array of short integers (16 bits).

2

octets holds a copy of the bytes that make up numbers.

3

These are the 10 bytes that represent the five short integers.

Creating a bytes or bytearray object from any buffer-like source will always copy the bytes. In contrast, memoryview objects let you share memory between binary data structures. To read structured information in binary sequences, the struct module is invaluable. We’ll see it working along with bytes and memoryview in “Structs and Memory Views”.

After this basic exploration of binary sequence types in Python, let’s see how they are converted to/from strings.

Basic Encoders/Decoders

The Python distribution bundles more than 100 codecs (encoder/decoder) for text to byte conversion and vice versa. Each codec has a name, like 'utf_8', and often aliases, such as 'utf8', 'utf-8', and 'U8', which you can use as the encoding argument in functions like open(), str.encode(), bytes.decode(), and so on. Example 4-4 shows the same text encoded as three different byte sequences.

Example 4-4. The string “El Niño” encoded with three codecs producing very different byte sequences
>>> for codec in ['latin_1', 'utf_8', 'utf_16']:
...     print(codec, 'El Niño'.encode(codec), sep='	')
...
latin_1 b'El Nixf1o'
utf_8   b'El Nixc3xb1o'
utf_16  b'xffxfeEx00lx00 x00Nx00ix00xf1x00ox00'

Figure 4-1 demonstrates a variety of codecs generating bytes from characters like the letter “A” through the G-clef musical symbol. Note that the last three encodings are variable-length, multibyte encodings.

Encodings demonstration table
Figure 4-1. Twelve characters, their code points, and their byte representation (in hex) in seven different encodings (asterisks indicate that the character cannot be represented in that encoding)

All those asterisks in Figure 4-1 make clear that some encodings, like ASCII and even the multibyte GB2312, cannot represent every Unicode character. The UTF encodings, however, are designed to handle every Unicode code point.

The encodings shown in Figure 4-1 were chosen as a representative sample:

latin1 a.k.a. iso8859_1

Important because it is the basis for other encodings, such as cp1252 and Unicode itself (note how the latin1 byte values appear in the cp1252 bytes and even in the code points).

cp1252

A latin1 superset by Microsoft, adding useful symbols like curly quotes and the € (euro); some Windows apps call it “ANSI,” but it was never a real ANSI standard.

cp437

The original character set of the IBM PC, with box drawing characters. Incompatible with latin1, which appeared later.

gb2312

Legacy standard to encode the simplified Chinese ideographs used in mainland China; one of several widely deployed multibyte encodings for Asian languages.

utf-8

The most common 8-bit encoding on the Web, by far; as of January, 2020, [W3Techs: Usage of Character Encodings for Websites] claims that 94.7% of sites use UTF-8, up from 81.4% when I wrote this paragraph in the 1st edition of Fluent Python in September, 2014.

utf-16le

One form of the UTF 16-bit encoding scheme; all UTF-16 encodings support code points beyond U+FFFF through escape sequences called “surrogate pairs.”

Warning

UTF-16 superseded the original 16-bit Unicode 1.0 encoding—UCS-2—way back in 1996. UCS-2 is still used in many systems despite being deprecated since the last century because it only supports code points up to U+FFFF. As of Unicode 12.1, more than 57% of the allocated code points are above U+FFFF, including the all-important emojis.

With this overview of common encodings now complete, we move to handling issues in encoding and decoding operations.

Understanding Encode/Decode Problems

Although there is a generic UnicodeError exception, the error reported by Python is usually more specific: either a UnicodeEncodeError (when converting str to binary sequences) or a UnicodeDecodeError (when reading binary sequences into str). Loading Python modules may also raise SyntaxError when the source encoding is unexpected. We’ll show how to handle all of these errors in the next sections.

Tip

The first thing to note when you get a Unicode error is the exact type of the exception. Is it a UnicodeEncodeError, a UnicodeDecodeError, or some other error (e.g., SyntaxError) that mentions an encoding problem? To solve the problem, you have to understand it first.

Coping with UnicodeEncodeError

Most non-UTF codecs handle only a small subset of the Unicode characters. When converting text to bytes, if a character is not defined in the target encoding, UnicodeEncodeError will be raised, unless special handling is provided by passing an errors argument to the encoding method or function. The behavior of the error handlers is shown in Example 4-5.

Example 4-5. Encoding to bytes: success and error handling
>>> city = 'São Paulo'
>>> city.encode('utf_8')  1
b'Sxc3xa3o Paulo'
>>> city.encode('utf_16')
b'xffxfeSx00xe3x00ox00 x00Px00ax00ux00lx00ox00'
>>> city.encode('iso8859_1')  2
b'Sxe3o Paulo'
>>> city.encode('cp437')  3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../lib/python3.4/encodings/cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character 'xe3' in
position 1: character maps to <undefined>
>>> city.encode('cp437', errors='ignore')  4
b'So Paulo'
>>> city.encode('cp437', errors='replace')  5
b'S?o Paulo'
>>> city.encode('cp437', errors='xmlcharrefreplace')  6
b'S&#227;o Paulo'
1

The 'utf_?' encodings handle any str.

2

'iso8859_1' also works for the 'São Paulo' str.

3

'cp437' can’t encode the 'ã' (“a” with tilde). The default error handler—'strict'—raises UnicodeEncodeError.

4

The error='ignore' handler silently skips characters that cannot be encoded; this is usually a very bad idea.

5

When encoding, error='replace' substitutes unencodable characters with '?'; data is lost, but users will get a clue that something is amiss.

6

'xmlcharrefreplace' replaces unencodable characters with an XML entity.

Note

The codecs error handling is extensible. You may register extra strings for the errors argument by passing a name and an error handling function to the codecs.register_error function. See the codecs.register_error documentation.

ASCII is a common subset to all the encodings that I know about, therefore encoding should always work if the text is made exclusively of ASCII characters. Python 3.7 added a new boolean method str.isascii() to check whether your Unicode text is 100% pure ASCII. If it is, you should be able to encode it to bytes in any encoding without raising UnicodeEncodeError.

Coping with UnicodeDecodeError

Not every byte holds a valid ASCII character, and not every byte sequence is valid UTF-8 or UTF-16; therefore, when you assume one of these encodings while converting a binary sequence to text, you will get a UnicodeDecodeError if unexpected bytes are found.

On the other hand, many legacy 8-bit encodings like 'cp1252', 'iso8859_1', and 'koi8_r' are able to decode any stream of bytes, including random noise, without reporting errors. Therefore, if your program assumes the wrong 8-bit encoding, it will silently decode garbage.

Tip

Garbled characters are known as gremlins or mojibake (文字化け—Japanese for “transformed text”).

Example 4-6 illustrates how using the wrong codec may produce gremlins or a UnicodeDecodeError.

Example 4-6. Decoding from str to bytes: success and error handling
>>> octets = b'Montrxe9al'  1
>>> octets.decode('cp1252')  2
'Montréal'
>>> octets.decode('iso8859_7')  3
'Montrιal'
>>> octets.decode('koi8_r')  4
'MontrИal'
>>> octets.decode('utf_8')  5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5:
invalid continuation byte
>>> octets.decode('utf_8', errors='replace')  6
'Montr�al'
1

These bytes are the characters for “Montréal” encoded as latin1; 'xe9' is the byte for “é”.

2

Decoding with 'cp1252' (Windows 1252) works because it is a proper superset of latin1.

3

ISO-8859-7 is intended for Greek, so the 'xe9' byte is misinterpreted, and no error is issued.

4

KOI8-R is for Russian. Now 'xe9' stands for the Cyrillic letter “И”.

5

The 'utf_8' codec detects that octets is not valid UTF-8, and raises UnicodeDecodeError.

6

Using 'replace' error handling, the xe9 is replaced by “�” (code point U+FFFD), the official Unicode REPLACEMENT CHARACTER intended to represent unknown characters.

SyntaxError When Loading Modules with Unexpected Encoding

UTF-8 is the default source encoding for Python 3, just as ASCII was the default for Python 2 (starting with 2.5). If you load a .py module containing non-UTF-8 data and no encoding declaration, you get a message like this:

SyntaxError: Non-UTF-8 code starting with 'xe1' in file ola.py on line
  1, but no encoding declared; see http://python.org/dev/peps/pep-0263/
  for details

Because UTF-8 is widely deployed in GNU/Linux and OSX systems, a likely scenario is opening a .py file created on Windows with cp1252. Note that this error happens even in Python for Windows, because the default encoding for Python 3 source is UTF-8 across all platforms.

To fix this problem, add a magic coding comment at the top of the file, as shown in Example 4-7.

Example 4-7. ola.py: “Hello, World!” in Portuguese
# coding: cp1252

print('Olá, Mundo!')
Tip

Now that Python 3 source code is no longer limited to ASCII and defaults to the excellent UTF-8 encoding, the best “fix” for source code in legacy encodings like 'cp1252' is to convert them to UTF-8 already, and not bother with the coding comments. If your editor does not support UTF-8, it’s time to switch.

Suppose you have a text file, be it source code or poetry, but you don’t know its encoding. How do you detect the actual encoding? The next section answers that with a library recommendation.

How to Discover the Encoding of a Byte Sequence

How do you find the encoding of a byte sequence? Short answer: you can’t. You must be told.

Some communication protocols and file formats, like HTTP and XML, contain headers that explicitly tell us how the content is encoded. You can be sure that some byte streams are not ASCII because they contain byte values over 127, and the way UTF-8 and UTF-16 are built also limits the possible byte sequences. But even then, you can never be 100% positive that a binary file is ASCII or UTF-8 just because certain bit patterns are not there.

However, considering that human languages also have their rules and restrictions, once you assume that a stream of bytes is human plain text it may be possible to sniff out its encoding using heuristics and statistics. For example, if b'x00' bytes are common, it is probably a 16- or 32-bit encoding, and not an 8-bit scheme, because null characters in plain text are bugs; when the byte sequence b'x20x00' appears often, it is likely to be the space character (U+0020) in a UTF-16LE encoding, rather than the obscure U+2000 EN QUAD character—whatever that is.

That is how the package Chardet — The Universal Character Encoding Detector works to guess one of more than 30 supported encodings. Chardet is a Python library that you can use in your programs, but also includes a command-line utility, chardetect. Here is what it reports on the source file for this chapter:

$ chardetect 04-text-byte.asciidoc
04-text-byte.asciidoc: utf-8 with confidence 0.99

Although binary sequences of encoded text usually don’t carry explicit hints of their encoding, the UTF formats may prepend a byte order mark to the textual content. That is explained next.

BOM: A Useful Gremlin

In Example 4-4, you may have noticed a couple of extra bytes at the beginning of a UTF-16 encoded sequence. Here they are again:

>>> u16 = 'El Niño'.encode('utf_16')
>>> u16
b'xffxfeEx00lx00 x00Nx00ix00xf1x00ox00'

The bytes are b'xffxfe'. That is a BOM—byte-order mark—denoting the “little-endian” byte ordering of the Intel CPU where the encoding was performed.

On a little-endian machine, for each code point the least significant byte comes first: the letter 'E', code point U+0045 (decimal 69), is encoded in byte offsets 2 and 3 as 69 and 0:

>>> list(u16)
[255, 254, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

On a big-endian CPU, the encoding would be reversed; 'E' would be encoded as 0 and 69.

To avoid confusion, the UTF-16 encoding prepends the text to be encoded with the special invisible character ZERO WIDTH NO-BREAK SPACE (U+FEFF). On a little-endian system, that is encoded as b'xffxfe' (decimal 255, 254). Because, by design, there is no U+FFFE character in Unicode, the byte sequence b'xffxfe' must mean the ZERO WIDTH NO-BREAK SPACE on a little-endian encoding, so the codec knows which byte ordering to use.

There is a variant of UTF-16—UTF-16LE—that is explicitly little-endian, and another one explicitly big-endian, UTF-16BE. If you use them, a BOM is not generated:

>>> u16le = 'El Niño'.encode('utf_16le')
>>> list(u16le)
[69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]
>>> u16be = 'El Niño'.encode('utf_16be')
>>> list(u16be)
[0, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111]

If present, the BOM is supposed to be filtered by the UTF-16 codec, so that you only get the actual text contents of the file without the leading ZERO WIDTH NO-BREAK SPACE. The Unicode standard says that if a file is UTF-16 and has no BOM, it should be assumed to be UTF-16BE (big-endian). However, the Intel x86 architecture is little-endian, so there is plenty of little-endian UTF-16 with no BOM in the wild.

This whole issue of endianness only affects encodings that use words of more than one byte, like UTF-16 and UTF-32. One big advantage of UTF-8 is that it produces the same byte sequence regardless of machine endianness, so no BOM is needed. Nevertheless, some Windows applications (notably Notepad) add the BOM to UTF-8 files anyway—and Excel depends on the BOM to detect a UTF-8 file, otherwise it assumes the content is encoded with a Windows code page. The character U+FEFF encoded in UTF-8 is the three-byte sequence b'xefxbbxbf'. So if a file starts with those three bytes, it is likely to be a UTF-8 file with a BOM. However, Python does not automatically assume a file is UTF-8 just because it starts with b'xefxbbxbf'.

We now move on to handling text files in Python 3.

Handling Text Files

The best practice for handling text I/O is the “Unicode sandwich” (Figure 4-2).5 This means that bytes should be decoded to str as early as possible on input (e.g., when opening a file for reading). The “filling” of the sandwich is the business logic of your program, where text handling is done exclusively on str objects. You should never be encoding or decoding in the middle of other processing. On output, the str are encoded to bytes as late as possible. Most web frameworks work like that, and we rarely touch bytes when using them. In Django, for example, your views should output Unicode str; Django itself takes care of encoding the response to bytes, using UTF-8 by default.

Unicode sandwich diagram
Figure 4-2. Unicode sandwich: current best practice for text processing

Python 3 makes it easier to follow the advice of the Unicode sandwich, because the open built-in does the necessary decoding when reading and encoding when writing files in text mode, so all you get from my_file.read() and pass to my_file.write(text) are str objects.6

Therefore, using text files is apparently simple. But if you rely on default encodings you will get bitten.

Consider the console session in Example 4-8. Can you spot the bug?

Example 4-8. A platform encoding issue (if you try this on your machine, you may or may not see the problem)
>>> open('cafe.txt', 'w', encoding='utf_8').write('café')
4
>>> open('cafe.txt').read()
'café'

The bug: I specified UTF-8 encoding when writing the file but failed to do so when reading it, so Python assumed Windows default file encoding—code page 1252—and the trailing bytes in the file were decoded as characters 'é' instead of 'é'.

I ran Example 4-8 on Python 3.8.1, 64 bits, on Windows 10 (build 18363). The same statements running on recent GNU/Linux or Mac OSX work perfectly well because their default encoding is UTF-8, giving the false impression that everything is fine. If the encoding argument was omitted when opening the file to write, the locale default encoding would be used, and we’d read the file correctly using the same encoding. But then this script would generate files with different byte contents depending on the platform or even depending on locale settings in the same platform, creating compatibility problems.

Tip

Code that has to run on multiple machines or on multiple occasions should never depend on encoding defaults. Always pass an explicit encoding= argument when opening text files, because the default may change from one machine to the next, or from one day to the next.

A curious detail in Example 4-8 is that the write function in the first statement reports that four characters were written, but in the next line five characters are read. Example 4-9 is an extended version of Example 4-8, explaining that and other details.

Example 4-9. Closer inspection of Example 4-8 running on Windows reveals the bug and how to fix it
>>> fp = open('cafe.txt', 'w', encoding='utf_8')
>>> fp  1
<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf_8'>
>>> fp.write('café')  2
4
>>> fp.close()
>>> import os
>>> os.stat('cafe.txt').st_size  3
5
>>> fp2 = open('cafe.txt')
>>> fp2  4
<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='cp1252'>
>>> fp2.encoding  5
'cp1252'
>>> fp2.read() 6
'café'
>>> fp3 = open('cafe.txt', encoding='utf_8')  7
>>> fp3
<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='utf_8'>
>>> fp3.read() 8
'café'
>>> fp4 = open('cafe.txt', 'rb')  9
>>> fp4                           10
<_io.BufferedReader name='cafe.txt'>
>>> fp4.read()  11
b'cafxc3xa9'
1

By default, open uses text mode and returns a TextIOWrapper object with a specific encoding.

2

The write method on a TextIOWrapper returns the number of Unicode characters written.

3

os.stat says the file has 5 bytes; UTF-8 encodes 'é' as 2 bytes, 0xc3 and 0xa9.

4

Opening a text file with no explicit encoding returns a TextIOWrapper with the encoding set to a default from the locale.

5

A TextIOWrapper object has an encoding attribute that you can inspect: cp1252 in this case.

6

In the Windows cp1252 encoding, the byte 0xc3 is an “Ô (A with tilde) and 0xa9 is the copyright sign.

7

Opening the same file with the correct encoding.

8

The expected result: the same four Unicode characters for 'café'.

9

The 'rb' flag opens a file for reading in binary mode.

10

The returned object is a BufferedReader and not a TextIOWrapper.

11

Reading that returns bytes, as expected.

Tip

Do not open text files in binary mode unless you need to analyze the file contents to determine the encoding—even then, you should be using Chardet instead of reinventing the wheel (see “How to Discover the Encoding of a Byte Sequence”). Ordinary code should only use binary mode to open binary files, like raster images.

The problem in Example 4-9 has to do with relying on a default setting while opening a text file. There are several sources for such defaults, as the next section shows.

Beware of Encoding Defaults

Several settings affect the encoding defaults for I/O in Python. See the default_encodings.py script in Example 4-10.

Example 4-10. Exploring encoding defaults
import sys, locale

expressions = """
        locale.getpreferredencoding()
        type(my_file)
        my_file.encoding
        sys.stdout.isatty()
        sys.stdout.encoding
        sys.stdin.isatty()
        sys.stdin.encoding
        sys.stderr.isatty()
        sys.stderr.encoding
        sys.getdefaultencoding()
        sys.getfilesystemencoding()
    """

my_file = open('dummy', 'w')

for expression in expressions.split():
    value = eval(expression)
    print(expression.rjust(30), '->', repr(value))

The output of Example 4-10 on GNU/Linux (Ubuntu 14.04 to 19.10) and MacOS (10.9 to 10.14) is identical, showing that UTF-8 is used everywhere in these systems:

$ python3 default_encodings.py
 locale.getpreferredencoding() -> 'UTF-8'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'UTF-8'
           sys.stdout.isatty() -> True
           sys.stdout.encoding -> 'utf-8'
            sys.stdin.isatty() -> True
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> True
           sys.stderr.encoding -> 'utf-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'

On Windows, however, the output is Example 4-11.

Example 4-11. Default encodings on Windows 10 PowerShell (output is the same on cmd.exe)
> chcp  1
Active code page: 437
> python default_encodings.py  2
 locale.getpreferredencoding() -> 'cp1252'  3
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'cp1252'  4
           sys.stdout.isatty() -> True      5
           sys.stdout.encoding -> 'utf-8'   6
            sys.stdin.isatty() -> True
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> True
           sys.stderr.encoding -> 'utf-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'
1

chcp shows the active code page for the console: 437.

2

Running default_encodings.py with output to console.

3

locale.getpreferredencoding() is the most important setting.

4

Text files use locale.getpreferredencoding() by default.

5

The output is going to the console, so sys.stdout.isatty() is True.

6

Now, sys.stdout.encoding is not the same as the console code page reported by chcp!

Unicode support in Windows itself, and in Python for Windows, got better since I wrote about this in the 1st edition of _Fluent Python. Example 4-11 used to report four different encodings in Python 3.4 on Windows 7. The encodings for stdout, stdin, and stderr used to be the same as the active code page reported by the chcp command, but now they’re all utf-8 thanks to PEP 528: Change Windows console encoding to UTF-8 implemented in Python 3.6, and Unicode support in PowerShell an cmd.exe (since Windows 1809 from October, 2018).7 It’s weird that chcp and sys.stdout.encoding say different things when stdout is writing to the console, but it’s great that now we can print Unicode strings without encoding errors on Windows—unless the user redirects output to a file, as we’ll soon see. That does not mean all your favorite emojis will appear in the console: that also depends on the font the console is using.

Another change was PEP 529: Change Windows filesystem encoding to UTF-8, also implemented in Python 3.6, which changed the file system encoding (used to represent names of directories and files) from Microsoft’s proprietary MBCS to UTF-8.

However, if the output of Example 4-10 is redirected to a file, like this:

Z:>python default_encodings.py > encodings.log

Then, the value of sys.stdout.isatty() becomes False, and sys.stdout.encoding is set by locale.getpreferredencoding(), 'cp1252' in that machine—but sys.stdin.encoding and sys.stderr.encoding remain utf-8.

This means that a script like Example 4-12 works when printing to the console, but may break when output is redirected to a file.

Example 4-12. stdout_check.py
import sys
from unicodedata import name

print(sys.version)
print()
print('sys.stdout.isatty():', sys.stdout.isatty())
print('sys.stdout.encoding:', sys.stdout.encoding)
print()

test_chars = [
    'u2026',  # HORIZONTAL ELLIPSIS (in cp1252)
    'u221E',  # INFINITY (in cp437)
    'u32B7',  # CIRCLED NUMBER FORTY TWO
]

for char in test_chars:
    print(f'Trying to output {name(char)}:')
    print(char)

Example 4-12 displays the result of sys.stdout.isatty(), the value of sys.stdout.encoding, and these three characters:

  • '…' HORIZONTAL ELLIPSIS (U+2026)--exists in CP 1252 but not in CP 437

  • '∞' INFINITY (U+221E)--exists in CP 437 but not in CP 1252

  • '㊷' CIRCLED NUMBER FORTY TWO (U+2026)--doesn’t exist in CP 1252 or CP 437

When I run stdout_check.py on PowerShell or cmd.exe, it works as captured in Figure 4-3.

Screen capture of `stdout_check.py` on PowerShell
Figure 4-3. Running stdout_check.py on PowerShell.

Despite chcp reporting the active code as 437, sys.stdout.encoding is UTF-8, so the HORIZONTAL ELLIPSIS and INFINITY both output correctly. The CIRCLED NUMBER FORTY TWO is replaced by a rectangle, but no error is raised. Presumably it is recognized as a valid character, but the console font doesn’t have the glyph to display it.

However, when I redirect the output of stdout_check.py to a file, I get Figure 4-4.

Screen capture of `stdout_check.py` on PowerShell, redirecting output
Figure 4-4. Running stdout_check.py on PowerShell, redirecting output.

The first problem demonstrated by Figure 4-4 is the UnicodeEncodeError mentioning character 'u221e', because sys.stdout.encoding is 'cp1252'--a code page that doesn’t have the INFINITY character.

Then, inspecting the partially-written out.txt, I get two surprises:

  1. Reading out.txt with the type command—or a Windows editor like VS Code or Sublime Text—shows that instead of HORIZONTAL ELLIPSIS, I got 'à' (LATIN SMALL LETTER A WITH GRAVE). As it turns out, the byte value 0x85 in CP 1252 means '…', but in CP 437 the same byte value represents 'à'. So it seems the active code page does matter, not in a sensible or useful way, but as partial explanation of a bad Unicode experience.

  2. out.txt was written with the UTF-16 LE encoding. This would be good, as UTF encodings support all Unicode characters—if it wasn’t for the unfortunate replacement of '…' with 'à'.

Note

I used a laptop configured for the US market, running Windows 10 OEM to run these experiments. Windows versions localized for other countries may have different encoding configurations. For example, in Brazil the Windows console uses code page 850 by default—not 437.

To wrap up this maddening issue of default encodings, let’s give a final look at the different encodings in Example 4-11:

  • If you omit the encoding argument when opening a file, the default is given by locale.getpreferredencoding() ('cp1252' in Example 4-11).

  • The encoding of sys.stdout|stdin|stderr used to be set by the PYTHONIOENCODING environment variable before Python 3.6—now that variable is ignored, unless PYTHONLEGACYWINDOWSSTDIO is set to a non-empty string. Otherwise, the encoding for standard I/O is UTF-8 for interactive I/O, or defined by locale.getpreferredencoding() if the output/input is redirected to/from a file.

  • sys.getdefaultencoding() is used internally by Python in implicit conversions of binary data to/from str; this happens less often in Python 3, but still happens.8 Changing this setting is not supported.9

  • sys.getfilesystemencoding() is used to encode/decode filenames (not file contents). It is used when open() gets a str argument for the filename; if the filename is given as a bytes argument, it is passed unchanged to the OS API. Before Python 3.6, this was MBCS on Windows, now it’s UTF-8. (On this topic, a useful answer on StackOverflow is “Difference between MBCS and UTF-8 on Windows”.)

Note

On GNU/Linux and OSX all of these encodings are set to UTF-8 by default, and have been for several years, so I/O handles all Unicode characters. On Windows, not only are different encodings used in the same system, but they are usually code pages like 'cp850' or 'cp1252' that support only ASCII with 127 additional characters that are not the same from one encoding to the other. Therefore, Windows users are far more likely to face encoding errors unless they are extra careful.

To summarize, the most important encoding setting is that returned by locale.getpreferredencoding(): it is the default for opening text files and for sys.stdout/stdin/stderr when they are redirected to files. However, the documentation reads (in part):

locale.getpreferredencoding(do_setlocale=True)

Return the encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess. […]

Therefore, the best advice about encoding defaults is: do not rely on them.

You will avoid a lot of pain if you follow the advice of the Unicode sandwich and always are explicit about the encodings in your programs. Unfortunately, Unicode is painful even if you get your bytes correctly converted to str. The next two sections cover subjects that are simple in ASCII-land, but get quite complex on planet Unicode: text normalization (i.e., converting text to a uniform representation for comparisons) and sorting.

Normalizing Unicode for Reliable Comparisons

String comparisons are complicated by the fact that Unicode has combining characters: diacritics and other marks that attach to the preceding character, appearing as one when printed.

For example, the word “café” may be composed in two ways, using four or five code points, but the result looks exactly the same:

>>> s1 = 'café'
>>> s2 = 'cafeu0301'
>>> s1, s2
('café', 'café')
>>> len(s1), len(s2)
(4, 5)
>>> s1 == s2
False

The code point U+0301 is the COMBINING ACUTE ACCENT. Using it after “e” renders “é”. In the Unicode standard, sequences like 'é' and 'eu0301' are called “canonical equivalents,” and applications are supposed to treat them as the same. But Python sees two different sequences of code points, and considers them not equal.

The solution is to use Unicode normalization, provided by the unicodedata.normalize function. The first argument to that function is one of four strings: 'NFC', 'NFD', 'NFKC', and 'NFKD'. Let’s start with the first two.

Normalization Form C (NFC) composes the code points to produce the shortest equivalent string, while NFD decomposes, expanding composed characters into base characters and separate combining characters. Both of these normalizations make comparisons work as expected:

>>> from unicodedata import normalize
>>> s1 = 'café'  # composed "e" with acute accent
>>> s2 = 'cafeu0301'  # decomposed "e" and acute accent
>>> len(s1), len(s2)
(4, 5)
>>> len(normalize('NFC', s1)), len(normalize('NFC', s2))
(4, 4)
>>> len(normalize('NFD', s1)), len(normalize('NFD', s2))
(5, 5)
>>> normalize('NFC', s1) == normalize('NFC', s2)
True
>>> normalize('NFD', s1) == normalize('NFD', s2)
True

Western keyboards usually generate composed characters, so text typed by users will be in NFC by default. However, to be safe, it may be good to normalize strings with normalize('NFC', user_text) before saving. NFC is also the normalization form recommended by the W3C in Character Model for the World Wide Web: String Matching and Searching.

Some single characters are normalized by NFC into another single character. The symbol for the ohm (Ω) unit of electrical resistance is normalized to the Greek uppercase omega. They are visually identical, but they compare unequal so it is essential to normalize to avoid surprises:

>>> from unicodedata import normalize, name
>>> ohm = 'u2126'
>>> name(ohm)
'OHM SIGN'
>>> ohm_c = normalize('NFC', ohm)
>>> name(ohm_c)
'GREEK CAPITAL LETTER OMEGA'
>>> ohm == ohm_c
False
>>> normalize('NFC', ohm) == normalize('NFC', ohm_c)
True

In the acronyms for the other two normalization forms—NFKC and NFKD—the letter K stands for “compatibility.” These are stronger forms of normalization, affecting the so-called “compatibility characters.” Although one goal of Unicode is to have a single “canonical” code point for each character, some characters appear more than once for compatibility with preexisting standards. For example, the micro sign, 'µ' (U+00B5), was added to Unicode to support round-trip conversion to latin1, even though the same character is part of the Greek alphabet with code point U+03BC (GREEK SMALL LETTER MU). So, the micro sign is considered a “compatibility character.”

In the NFKC and NFKD forms, each compatibility character is replaced by a “compatibility decomposition” of one or more characters that are considered a “preferred” representation, even if there is some formatting loss—ideally, the formatting should be the responsibility of external markup, not part of Unicode. To exemplify, the compatibility decomposition of the one half fraction '½' (U+00BD) is the sequence of three characters '1/2', and the compatibility decomposition of the micro sign 'µ' (U+00B5) is the lowercase mu 'μ' (U+03BC).10

Here is how the NFKC works in practice:

>>> from unicodedata import normalize, name
>>> half = '½'
>>> normalize('NFKC', half)
'1⁄2'
>>> four_squared = '4²'
>>> normalize('NFKC', four_squared)
'42'
>>> micro = 'µ'
>>> micro_kc = normalize('NFKC', micro)
>>> micro, micro_kc
('µ', 'μ')
>>> ord(micro), ord(micro_kc)
(181, 956)
>>> name(micro), name(micro_kc)
('MICRO SIGN', 'GREEK SMALL LETTER MU')

Although '1⁄2' is a reasonable substitute for '½', and the micro sign is really a lowercase Greek mu, converting '4²' to '42' changes the meaning. An application could store '4²' as '4<sup>2</sup>', but the normalize function knows nothing about formatting. Therefore, NFKC or NFKD may lose or distort information, but they can produce convenient intermediate representations for searching and indexing: users may be pleased that a search for '1⁄2 inch' also finds documents containing '½ inch'.

Warning

NFKC and NFKD normalization should be applied with care and only in special cases—e.g., search and indexing—and not for permanent storage, because these transformations cause data loss.

When preparing text for searching or indexing, another operation is useful: case folding, our next subject.

Case Folding

Case folding is essentially converting all text to lowercase, with some additional transformations. It is supported by the str.casefold() method since Python 3.3.

For any string s containing only latin1 characters, s.casefold() produces the same result as s.lower(), with only two exceptions—the micro sign 'µ' is changed to the Greek lowercase mu (which looks the same in most fonts) and the German Eszett or “sharp s” (ß) becomes “ss”:

>>> micro = 'µ'
>>> name(micro)
'MICRO SIGN'
>>> micro_cf = micro.casefold()
>>> name(micro_cf)
'GREEK SMALL LETTER MU'
>>> micro, micro_cf
('µ', 'μ')
>>> eszett = 'ß'
>>> name(eszett)
'LATIN SMALL LETTER SHARP S'
>>> eszett_cf = eszett.casefold()
>>> eszett, eszett_cf
('ß', 'ss')

There are nearly 300 code points for which str.casefold() and str.lower() return different results.

As usual with anything related to Unicode, case folding is a complicated issue with plenty of linguistic special cases, but the Python core team made an effort to provide a solution that hopefully works for most users.

In the next couple of sections, we’ll put our normalization knowledge to use developing utility functions.

Utility Functions for Normalized Text Matching

As we’ve seen, NFC and NFD are safe to use and allow sensible comparisons between Unicode strings. NFC is the best normalized form for most applications. str.casefold() is the way to go for case-insensitive comparisons.

If you work with text in many languages, a pair of functions like nfc_equal and fold_equal in Example 4-13 are useful additions to your toolbox.

Example 4-13. normeq.py: normalized Unicode string comparison
"""
Utility functions for normalized Unicode string comparison.

Using Normal Form C, case sensitive:

    >>> s1 = 'café'
    >>> s2 = 'cafeu0301'
    >>> s1 == s2
    False
    >>> nfc_equal(s1, s2)
    True
    >>> nfc_equal('A', 'a')
    False

Using Normal Form C with case folding:

    >>> s3 = 'Straße'
    >>> s4 = 'strasse'
    >>> s3 == s4
    False
    >>> nfc_equal(s3, s4)
    False
    >>> fold_equal(s3, s4)
    True
    >>> fold_equal(s1, s2)
    True
    >>> fold_equal('A', 'a')
    True

"""

from unicodedata import normalize

def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)

def fold_equal(str1, str2):
    return (normalize('NFC', str1).casefold() ==
            normalize('NFC', str2).casefold())

Beyond Unicode normalization and case folding—which are both part of the Unicode standard—sometimes it makes sense to apply deeper transformations, like changing 'café' into 'cafe'. We’ll see when and how in the next section.

Extreme “Normalization”: Taking Out Diacritics

The Google Search secret sauce involves many tricks, but one of them apparently is ignoring diacritics (e.g., accents, cedillas, etc.), at least in some contexts. Removing diacritics is not a proper form of normalization because it often changes the meaning of words and may produce false positives when searching. But it helps coping with some facts of life: people sometimes are lazy or ignorant about the correct use of diacritics, and spelling rules change over time, meaning that accents come and go in living languages.

Outside of searching, getting rid of diacritics also makes for more readable URLs, at least in Latin-based languages. Take a look at the URL for the Wikipedia article about the city of São Paulo:

http://en.wikipedia.org/wiki/S%C3%A3o_Paulo

The %C3%A3 part is the URL-escaped, UTF-8 rendering of the single letter “ã” (“a” with tilde). The following is much friendlier, even if it is not the right spelling:

http://en.wikipedia.org/wiki/Sao_Paulo

To remove all diacritics from a str, you can use a function like Example 4-14.

Example 4-14. Function to remove all combining marks (module simplify.py).
import unicodedata
import string


def shave_marks(txt):
    """Remove all diacritic marks"""
    norm_txt = unicodedata.normalize('NFD', txt)  1
    shaved = ''.join(c for c in norm_txt
                     if not unicodedata.combining(c))  2
    return unicodedata.normalize('NFC', shaved)  3
1

Decompose all characters into base characters and combining marks.

2

Filter out all combining marks.

3

Recompose all characters.

Example 4-15 shows a couple of uses of shave_marks.

Example 4-15. Two examples using shave_marks from Example 4-14
>>> order = '“Herr Voß: • ½ cup of Œtker™ caffè latte • bowl of açaí.”'
>>> shave_marks(order)
'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'  1
>>> Greek = 'Ζέφυρος, Zéfiro'
>>> shave_marks(Greek)
'Ζεφυρος, Zefiro'  2
1

Only the letters “è”, “ç”, and “í” were replaced.

2

Both “έ” and “é” were replaced.

The function shave_marks from Example 4-14 works all right, but maybe it goes too far. Often the reason to remove diacritics is to change Latin text to pure ASCII, but shave_marks also changes non-Latin characters—like Greek letters—which will never become ASCII just by losing their accents. So it makes sense to analyze each base character and to remove attached marks only if the base character is a letter from the Latin alphabet. This is what Example 4-16 does.

Example 4-16. Function to remove combining marks from Latin characters (import statements are omitted as this is part of the simplify.py module from Example 4-14)
def shave_marks_latin(txt):
    """Remove all diacritic marks from Latin base characters"""
    norm_txt = unicodedata.normalize('NFD', txt)  1
    latin_base = False
    preserve = []
    for c in norm_txt:
        if unicodedata.combining(c) and latin_base:   2
            continue  # ignore diacritic on Latin base char
        preserve.append(c)                            3
        # if it isn't combining char, it's a new base char
        if not unicodedata.combining(c):              4
            latin_base = c in string.ascii_letters
    shaved = ''.join(preserve)
    return unicodedata.normalize('NFC', shaved)   5
1

Decompose all characters into base characters and combining marks.

2

Skip over combining marks when base character is Latin.

3

Otherwise, keep current character.

4

Detect new base character and determine if it’s Latin.

5

Recompose all characters.

An even more radical step would be to replace common symbols in Western texts (e.g., curly quotes, em dashes, bullets, etc.) into ASCII equivalents. This is what the function asciize does in Example 4-17.

Example 4-17. Transform some Western typographical symbols into ASCII (this snippet is also part of simplify.py from Example 4-14)
single_map = str.maketrans("""‚ƒ„ˆ‹‘’“”•–—˜›""",  1
                           """'f"^<''""---~>""")

multi_map = str.maketrans({  2
    '': 'EUR',
    '': '...',
    'Æ': 'AE',
    'æ': 'ae',
    'Œ': 'OE',
    'œ': 'oe',
    '': '(TM)',
    '': '<per mille>',
    '': '**',
    '': '***',
})

multi_map.update(single_map)  3


def dewinize(txt):
    """Replace Win1252 symbols with ASCII chars or sequences"""
    return txt.translate(multi_map)  4


def asciize(txt):
    no_marks = shave_marks_latin(dewinize(txt))     5
    no_marks = no_marks.replace('ß', 'ss')          6
    return unicodedata.normalize('NFKC', no_marks)  7
1

Build mapping table for char-to-char replacement.

2

Build mapping table for char-to-string replacement.

3

Merge mapping tables.

4

dewinize does not affect ASCII or latin1 text, only the Microsoft additions in to latin1 in cp1252.

5

Apply dewinize and remove diacritical marks.

6

Replace the Eszett with “ss” (we are not using case fold here because we want to preserve the case).

7

Apply NFKC normalization to compose characters with their compatibility code points.

Example 4-18 shows asciize in use.

Example 4-18. Two examples using asciize from Example 4-17
>>> order = '“Herr Voß: • ½ cup of Œtker™ caffè latte • bowl of açaí.”'
>>> dewinize(order)
'"Herr Voß: - ½ cup of OEtker(TM) caffè latte - bowl of açaí."'  1
>>> asciize(order)
'"Herr Voss: - 1⁄2 cup of OEtker(TM) caffe latte - bowl of acai."'  2
1

dewinize replaces curly quotes, bullets, and ™ (trademark symbol).

2

asciize applies dewinize, drops diacritics, and replaces the 'ß'.

Warning

Different languages have their own rules for removing diacritics. For example, Germans change the 'ü' into 'ue'. Our asciize function is not as refined, so it may or not be suitable for your language. It works acceptably for Portuguese, though.

To summarize, the functions in simplify.py go way beyond standard normalization and perform deep surgery on the text, with a good chance of changing its meaning. Only you can decide whether to go so far, knowing the target language, your users, and how the transformed text will be used.

This wraps up our discussion of normalizing Unicode text.

The next Unicode matter to sort out is… sorting.

Sorting Unicode Text

Python sorts sequences of any type by comparing the items in each sequence one by one. For strings, this means comparing the code points. Unfortunately, this produces unacceptable results for anyone who uses non-ASCII characters.

Consider sorting a list of fruits grown in Brazil:

>>> fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
>>> sorted(fruits)
['acerola', 'atemoia', 'açaí', 'caju', 'cajá']

Sorting rules vary for different locales, but in Portuguese and many languages that use the Latin alphabet, accents and cedillas rarely make a difference when sorting.11 So “cajá” is sorted as “caja,” and must come before “caju.”

The sorted fruits list should be:

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

The standard way to sort non-ASCII text in Python is to use the locale.strxfrm function which, according to the locale module docs, “transforms a string to one that can be used in locale-aware comparisons.”

To enable locale.strxfrm, you must first set a suitable locale for your application, and pray that the OS supports it. The sequence of commands in Example 4-19 may work for you.

Example 4-19. locale_sort.py: using the locale.strxfrm function as sort key
include::code/04-text-byte/locale_sort.py

Running Example 4-19 on GNU/Linux (Ubuntu 19.10) with the pt_BR.UTF-8 locale installed, I get this result:

'pt_BR.UTF-8'
['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

So you need to call setlocale(LC_COLLATE, «your_locale») before using locale.strxfrm as the key when sorting.

There are some caveats, though:

  • Because locale settings are global, calling setlocale in a library is not recommended. Your application or framework should set the locale when the process starts, and should not change it afterwards.

  • The locale must be installed on the OS, otherwise setlocale raises a locale.Error: unsupported locale setting exception.

  • You must know how to spell the locale name. They are pretty much standardized in Unix derivatives as 'language_code.encoding', but on Windows the syntax is more complicated: Language Name-Language Variant_Region Name.codepage. Note that the Language Name, Language Variant, and Region Name parts can have spaces inside them, but the parts after the first are prefixed with special different characters: a hyphen, an underline character, and a dot. All parts seem to be optional except the language name. For example: English_United States.850 means Language Name “English”, region “United States”, and code page “850”. The language and region names Windows understands are listed in the MSDN article Language Identifier Constants and Strings, while Code Page Identifiers lists the numbers for the last part.12

  • The locale must be correctly implemented by the makers of the OS. I was successful on Ubuntu 19.10, but not on MacOS 10.14. On MacOS, the call setlocale(LC_COLLATE, 'pt_BR.UTF-8') returns the string 'pt_BR.UTF-8' with no complaints. But sorted(fruits, key=locale.strxfrm) produced the same incorrect result as sorted(fruits) did. I also tried the fr_FR, es_ES, and de_DE locales on OSX, but locale.strxfrm never did its job.13

So the standard library solution to internationalized sorting works, but seems to be well supported only on GNU/Linux (perhaps also on Windows, if you are an expert). Even then, it depends on locale settings, creating deployment headaches.

Fortunately, there is a simpler solution: the PyUCA library, available on PyPI.

Sorting with the Unicode Collation Algorithm

James Tauber, prolific Django contributor, must have felt the pain and created PyUCA, a pure-Python implementation of the Unicode Collation Algorithm (UCA). Example 4-20 shows how easy it is to use.

Example 4-20. Using the pyuca.Collator.sort_key method
>>> import pyuca
>>> coll = pyuca.Collator()
>>> fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
>>> sorted_fruits = sorted(fruits, key=coll.sort_key)
>>> sorted_fruits
['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

This is friendly and just works. I tested it on GNU/Linux, OSX, and Windows. Only Python 3.X is supported at this time.

PyUCA does not take the locale into account. If you need to customize the sorting, you can provide the path to a custom collation table to the Collator() constructor. Out of the box, it uses allkeys.txt, which is bundled with the project. That’s just a copy of the Default Unicode Collation Element Table from Unicode.org.

By the way, that table is one of the many that comprise the Unicode database, our next subject.

The Unicode Database

The Unicode standard provides an entire database—in the form of several structured text files—that includes not only the table mapping code points to character names, but also metadata about the individual characters and how they are related. For example, the Unicode database records whether a character is printable, is a letter, is a decimal digit, or is some other numeric symbol. That’s how the str methods isidentifier, isprintable, isdecimal, and isnumeric work. str.casefold also uses information from a Unicode table.

Finding characters by name

The unicodedata module has functions to retrieve character metadata, including unicodedata.name(), which returns a character’s official name in the standard. Figure 4-5 demonstrates that function.14

Exploring unicodedata.name in the Python console
Figure 4-5. Exploring unicodedata.name() in the Python console

You can use the name() function to build apps that let users search for characters by name. Figure 4-6 demonstrates the cf.py command-line script that takes one or more words as arguments, and lists the characters that have those words in their official Unicode names. The full source code for cf.py is in Example 4-21.

Using cf.py to find smiling cats.
Figure 4-6. Using cf.py to find smiling cats.
Warning

Emoji support varies widely accross desktop operating systems, shells, and apps. In recent years the MacOS terminal offers the best support for emojis, followed by modern GNU/Linux graphic terminals. Windows cmd.exe and PowerShell support Unicode output since 2018, but as I write this in January 2020, they still don’t display emojis—at least not “out of the box”.

In Example 4-21, note the if statement in the find function using the .issubset() method to quickly test whether all the words in the query set appear in the list of words built from the character’s name. Thanks to Python’s rich set API, we don’t need a nested for loop and another if to implement this test.

Example 4-21. cf.py: the character finder utility
#!/usr/bin/env python3
import sys
import unicodedata

FIRST, LAST = ord(' '), sys.maxunicode              1


def find(*query_words, first=FIRST, last=LAST):     2
    query = {w.upper() for w in query_words}        3
    count = 0
    for code in range(first, last + 1):
        char = chr(code)                            4
        name = unicodedata.name(char, None)         5
        if name and query.issubset(name.split()):   6
            print(f'U+{code:04X}	{char}	{name}')  7
            count += 1
    print(f'({count} found)')


def main(words):
    if words:
        find(*words)
    else:
        print('Please provide words to find.')


if __name__ == '__main__':
    main(sys.argv[1:])
1

Set defaults for first and last code points to search.

2

find takes zero or more query_words, and optional keyword-only arguments to limit the range of the search, for easier testing.

3

Convert the query_words into a set of uppercased strings.

4

Get Unicode character for the code.

5

Get name of character, or None if the code point is unnassigned.

6

If there is a name, split it into a list words, then check that query is a subset of that.

7

Print out line with code point in U+9999 format, the character and its name.

The unicodedata module has other interesting functions. Next we’ll see a few that are related to getting information from characters that have numeric meaning.

Numeric meaning of characters

The unicodedata module includes functions to check whether a Unicode character represents a number and, if so, its numeric value for humans—as opposed to its code point number. Example 4-22 shows the use of unicodedata.name() and unicodedata.numeric() along with the .isdecimal() and .isnumeric() methods of str.

Example 4-22. Demo of Unicode database numerical character metadata (callouts describe each column in the output)
import unicodedata
import re

re_digit = re.compile(r'd')

sample = '1xbcxb2u0969u136bu216bu2466u2480u3285'

for char in sample:
    print('U+%04x' % ord(char),                       1
          char.center(6),                             2
          're_dig' if re_digit.match(char) else '-',  3
          'isdig' if char.isdigit() else '-',         4
          'isnum' if char.isnumeric() else '-',       5
          format(unicodedata.numeric(char), '5.2f'),  6
          unicodedata.name(char),                     7
          sep='	')
1

Code point in U+0000 format.

2

Character centralized in a str of length 6.

3

Show re_dig if character matches the r'd' regex.

4

Show isdig if char.isdigit() is True.

5

Show isnum if char.isnumeric() is True.

6

Numeric value formated with width 5 and 2 decimal places.

7

Unicode character name.

Running Example 4-22 gives you Figure 4-7, if your terminal font has all those glyphs.

Numeric characters screen shot
Figure 4-7. MacOS terminal showing numeric characters and metadata about them; re_dig means the character matches the regular expression r’d’

The sixth column of Figure 4-7 is the result of calling unicodedata.numeric(char) on the character. It shows that Unicode knows the numeric value of symbols that represent numbers. So if you want to create a spreadsheet application that supports Tamil digits or Roman numerals, go for it!

Figure 4-7 shows that the regular expression r'd' matches the digit “1” and the Devanagari digit 3, but not some other characters that are considered digits by the isdigit function. The re module is not as savvy about Unicode as it could be. The new regex module available in PyPI was designed to eventually replace re and provides better Unicode support.15 We’ll come back to the re module in the next section.

Throughout this chapter we’ve used several unicodedata functions, but there are many more we did not cover. See the standard library documentation for the unicodedata module.

Next we’ll take a quick look at a new trend: dual-mode APIs offering functions that accept str or bytes arguments with special handling depending on the type.

Dual-Mode str and bytes APIs

Python’s standard library has functions that accept str or bytes arguments and behave differently depending on the type. Some examples are in the re and os modules.

str Versus bytes in Regular Expressions

If you build a regular expression with bytes, patterns such as d and w only match ASCII characters; in contrast, if these patterns are given as str, they match Unicode digits or letters beyond ASCII. Example 4-23 and Figure 4-8 compare how letters, ASCII digits, superscripts, and Tamil digits are matched by str and bytes patterns.

Example 4-23. ramanujan.py: compare behavior of simple str and bytes regular expressions
import re

re_numbers_str = re.compile(r'd+')     1
re_words_str = re.compile(r'w+')
re_numbers_bytes = re.compile(rb'd+')  2
re_words_bytes = re.compile(rb'w+')

text_str = ("Ramanujan saw u0be7u0bedu0be8u0bef"  3
            " as 1729 = 1³ + 12³ = 9³ + 10³.")        4

text_bytes = text_str.encode('utf_8')  5

print('Text', repr(text_str), sep='
  ')
print('Numbers')
print('  str  :', re_numbers_str.findall(text_str))      6
print('  bytes:', re_numbers_bytes.findall(text_bytes))  7
print('Words')
print('  str  :', re_words_str.findall(text_str))        8
print('  bytes:', re_words_bytes.findall(text_bytes))    9
1

The first two regular expressions are of the str type.

2

The last two are of the bytes type.

3

Unicode text to search, containing the Tamil digits for 1729 (the logical line continues until the right parenthesis token).

4

This string is joined to the previous one at compile time (see “2.4.2. String literal concatenation” in The Python Language Reference).

5

A bytes string is needed to search with the bytes regular expressions.

6

The str pattern r'd+' matches the Tamil and ASCII digits.

7

The bytes pattern rb'd+' matches only the ASCII bytes for digits.

8

The str pattern r'w+' matches the letters, superscripts, Tamil, and ASCII digits.

9

The bytes pattern rb'w+' matches only the ASCII bytes for letters and digits.

Output of ramanujan.py
Figure 4-8. Screenshot of running ramanujan.py from Example 4-23

Example 4-23 is a trivial example to make one point: you can use regular expressions on str and bytes, but in the second case bytes outside of the ASCII range are treated as nondigits and nonword characters.

For str regular expressions, there is a re.ASCII flag that makes w, W, , B, d, D, s, and S perform ASCII-only matching. See the documentation of the re module for full details.

Another important dual-mode module is os.

str Versus bytes in os Functions

The GNU/Linux kernel is not Unicode savvy, so in the real world you may find filenames made of byte sequences that are not valid in any sensible encoding scheme, and cannot be decoded to str. File servers with clients using a variety of OSes are particularly prone to this problem.

In order to work around this issue, all os module functions that accept filenames or pathnames take arguments as str or bytes. If one such function is called with a str argument, the argument will be automatically converted using the codec named by sys.getfilesystemencoding(), and the OS response will be decoded with the same codec. This is almost always what you want, in keeping with the Unicode sandwich best practice.

But if you must deal with (and perhaps fix) filenames that cannot be handled in that way, you can pass bytes arguments to the os functions to get bytes return values. This feature lets you deal with any file or pathname, no matter how many gremlins you may find. See Example 4-24.

Example 4-24. listdir with str and bytes arguments and results
>>> os.listdir('.')  1
['abc.txt', 'digits-of-π.txt']
>>> os.listdir(b'.')  2
[b'abc.txt', b'digits-of-xcfx80.txt']
1

The second filename is “digits-of-π.txt” (with the Greek letter pi).

2

Given a byte argument, listdir returns filenames as bytes: b'xcfx80' is the UTF-8 encoding of the Greek letter pi).

To help with manual handling of str or bytes sequences that are file or pathnames, the os module provides special encoding and decoding functions fsencode(name_or_path) and os.fsdecode(name_or_path). Both of these functions accept an argument of type str, bytes, or—since Python 3.6—an object implementing the os.PathLike interface.

Enough suffering. Let’s wrap up our tour of str versus bytes with a fun topic: building emojis.

Multi-character emojis

As we saw in “Normalizing Unicode for Reliable Comparisons”, it’s always been possible to produce accented characters by combining Unicode letters and diacritics. To accomodate the growing demand for emojis, this idea has been extended to produce different pictographs by combining special markers and emoji characters. Let’s start with the simplest kind of combined emoji: flags of countries.

Country flags

Throughout history, countries split, join, mutate or simply adopt new flags. The Unicode consortium found a way to avoid keeping up with those changes and outsource the problem to the systems that claim Unicode support: its character database has no country flags. Instead there is a set of 26 “regional indicator symbols letters”, from A (U+1F1E6) to Z (U+1F1FF). When you combine two of those indicator letters to form an ISO 3166-1 country code, you get the corresponding country flag—if the UI supports it. Example 4-25 shows how.

Example 4-25. two_flags.py: combining regional indicators to produce flags
# REGIONAL INDICATOR SYMBOLS
RIS_A = 'U0001F1E6'  # LETTER A
RIS_U = 'U0001F1FA'  # LETTER U
print(RIS_A + RIS_U)  # AU: Australia
print(RIS_U + RIS_A)  # UA: Ukraine
print(RIS_A + RIS_A)  # AA: no such country

Figure 4-9 shows the output of Example 4-25 on a MacOS 10.14 terminal.

Output of two_flags.py
Figure 4-9. Screenshot of running two_flags.py from Example 4-25. The AA combination is shown as two letters A inside dashed squares.

If your program outputs a combination of indicator letters that is not recognized by the app, you get the indicators displayed as letters inside dashed squares—again, depending on the UI. See the last line in Figure 4-9.

Note

Europe and the United Nations are not countries, but their flags are supported by the regional indicator pairs EU and UN, respectively. England, Scotland, and Wales may or may not be separate countries by the time you read this, but they also have flags supported by Unicode. However, instead of regional indicator letters, those flags require a more complicated scheme. Read Emoji Flags Explained on Emojipedia to learn how that works.

Now let’s see how emoji modifiers can be used to set the skin tone of emojis that show human faces, hands, noses etc.

Skin tones

Unicode provides a set of 5 emoji modifiers to set skin tone from pale to dark brown. They are based on the Fitzpatrick scale—developed to study the effects of ultraviolet light on human skin. Example 4-26 shows the use of those modifiers to set the skin tone of the thumbs up emoji.

Example 4-26. skin.py: the thumbs up emoji by itself, followed by all available skin tone modifiers.
from unicodedata import name

SKIN1 = 0x1F3FB  # EMOJI MODIFIER FITZPATRICK TYPE-1-2  1
SKINS = [chr(i) for i in range(SKIN1, SKIN1 + 5)]       2
THUMB = 'U0001F44d'  # THUMBS UP SIGN ?

examples = [THUMB]                                      3
examples.extend(THUMB + skin for skin in SKINS)         4

for example in examples:
    print(example, end='	')                            5
    print(' + '.join(name(char) for char in example))   6
1

EMOJI MODIFIER FITZPATRICK TYPE-1-2 is the first modifier.

2

Build list with all five modifiers.

3

Start list with the unmodified THUMBS UP SIGN.

4

Extend list with the same emoji followed by each of the modifiers.

5

Display emoji and tab.

6

Display names of characters combined in the emoji, joined by ' + '.

The output of Example 4-26 looks like Figure 4-10 on MacOS. As you can see, the unmodified emoji has a cartoonish yellow color, while the others have more realistic skin tones.

Thumbs up emoji in 6 colors.
Figure 4-10. Screenshot of Example 4-26 in the MacOS 10.14 terminal.

Let’s now move to more complex emoji combinations using special markers.

Rainbow flag and other ZWJ sequences

Besides the special purpose indicators and modifiers we’ve seen, Unicode provides a marker that is used as glue between emojis and other characters, to produce new combinations: U+200D, ZERO WIDTH JOINER—a.k.a. ZWJ in many Unicode documents.

For example rainbow flag is built by joining the emojis WAVING WHITE FLAG and RAINBOW, as Figure 4-11 shows.

Making rainbow flag in console
Figure 4-11. Making the rainbow flag in the Python console.

Unicode 13 supports more than 1100 ZWJ emoji sequences as RGI—“recommended for general interchange […] intended to be widely supported across multiple platforms”.16 You can find the full list of RGI ZWJ emoji sequences in emoji-zwj-sequences.txt and a small sample in Figure 4-12.

Making rainbow flag in console
Figure 4-12. Sample ZWJ sequences generated by Example 4-27, running in a Jupyter Notebook, viewed on Firefox 72 on Ubuntu 19.10. This browser/OS combo can display all the emojis from this sample, including the newest: “people holding hands” and “transgender flag”, added in Emoji 12.0 and 13.0.

Example 4-27 is the source code that produced Figure 4-12. You can run it from your shell, but for better results I recommend pasting it inside a Jupyter Notebook to run it in a browser. Browsers often lead the way in Unicode support, and provide prettier emoji pictographs.

Example 4-27. zwj_sample.py: produce listing with a few ZWJ characters.
from unicodedata import name

zwg_sample = """
1F468 200D 1F9B0            |man: red hair                      |E11.0
1F9D1 200D 1F91D 200D 1F9D1 |people holding hands               |E12.0
1F3CA 1F3FF 200D 2640 FE0F  |woman swimming: dark skin tone     |E4.0
1F469 1F3FE 200D 2708 FE0F  |woman pilot: medium-dark skin tone |E4.0
1F468 200D 1F469 200D 1F467 |family: man, woman, girl           |E2.0
1F3F3 FE0F 200D 26A7 FE0F   |transgender flag                   |E13.0
1F469 200D 2764 FE0F 200D 1F48B 200D 1F469 |kiss: woman, woman  |E2.0
"""

markers = {'u200D': 'ZWG', # ZERO WIDTH JOINER
           'uFE0F': 'V16', # VARIATION SELECTOR-16
          }

for line in zwg_sample.strip().split('
'):
    code, descr, version = (s.strip() for s in line.split('|'))
    chars = [chr(int(c, 16)) for c in code.split()]
    print(''.join(chars), version, descr, sep='	', end='')
    while chars:
        char = chars.pop(0)
        if char in markers:
            print(' + ' + markers[char], end='')
        else:
            ucode = f'U+{ord(char):04X}'
            print(f'
	{char}	{ucode}	{name(char)}', end='')
    print()

One trend in modern Unicode is the addition of gender-neutral emojis such as SWIMMER (U+1F3CA) or ADULT (U+1F9D1), which can then be shown as they are, or with different gender in ZWJ sequences with the female sign ♀ (U+2640) or the male sign ♂ (U+2642). The Unicode Consortium is also moving towards more diversity in the supported family emojis. Figure 4-13 is a matrix of family emojis showing current support for families with different combinations of parents and children—as of January, 2020.

Matrix of emoji families
Figure 4-13. The table shows adult singles and couples at the top, and boys and girls on the left side. Cells have the combined emoji of a family with the parent(s) from the top and kid(s) from the left. If a combination is not supported by the browser, more than one emoji will appear inside a cell. Firefox 72 on Windows 10 is able to show all combinations.

The code I wrote to build Figure 4-13 is mostly concerned with HTML formatting, but is listed in [Link to Come] for completeness.

Example 4-28.

Browsers follow the evolution of Unicode Emoji closely, and here no OS has a clear advantage. While preparing this chapter, I captured Figure 4-12 on Ubuntu 19.10 and Figure 4-13 on Windows 10, using Firefox 72 on both, because those were the OS/browser combinations with the most complete support for the emojis in those examples.

Unicode is a fascinating topic. However, now is time to wrap up our exploration of str and bytes.

Chapter Summary

We started the chapter by dismissing the notion that 1 character == 1 byte. As the world adopts Unicode, we need to keep the concept of text strings separated from the binary sequences that represent them in files, and Python 3 enforces this separation.

After a brief overview of the binary sequence data types—bytes, bytearray, and memoryview—we jumped into encoding and decoding, with a sampling of important codecs, followed by approaches to prevent or deal with the infamous UnicodeEncodeError, UnicodeDecodeError, and the SyntaxError caused by wrong encoding in Python source files.

While on the subject of source code, I presented my opinion on the debate about non-ASCII identifiers: if the maintainers of the code base want to use a human language that is not limited to ASCII characters, the identifiers should be spelled correctly. That’s precisely why Python 3 accepts non-ASCII identifiers.

We then considered the theory and practice of encoding detection in the absence of metadata: in theory, it can’t be done, but in practice the Chardet package pulls it off pretty well for a number of popular encodings. Byte order marks were then presented as the only encoding hint commonly found in UTF-16 and UTF-32 files—sometimes in UTF-8 files as well.

In the next section, we demonstrated opening text files, an easy task except for one pitfall: the encoding= keyword argument is not mandatory when you open a text file, but it should be. If you fail to specify the encoding, you end up with a program that manages to generate “plain text” that is incompatible across platforms, due to conflicting default encodings. We then exposed the different encoding settings that Python uses as defaults and how to detect them: locale.getpreferredencoding(), sys.getfilesystemencoding(), sys.getdefaultencoding(), and the encodings for the standard I/O files (e.g., sys.stdout.encoding). A sad realization for Windows users is that these settings often have distinct values within the same machine, and the values are mutually incompatible; GNU/Linux and OSX users, in contrast, live in a happier place where UTF-8 is the default pretty much everywhere.

Text comparisons are surprisingly complicated because Unicode provides multiple ways of representing some characters, so normalizing is a prerequisite to text matching. In addition to explaining normalization and case folding, we presented some utility functions that you may adapt to your needs, including drastic transformations like removing all accents. We then saw how to sort Unicode text correctly by leveraging the standard locale module—with some caveats—and an alternative that does not depend on tricky locale configurations: the external PyUCA package.

Then we leveraged the Unicode database to build a command-line utility to search for characters by name–in 28 lines of code, thanks to the power of Python. We glanced at other Unicode metadata, and had a brief overview of dual-mode APIs (e.g., the re and os modules, where some functions can be called with str or bytes arguments, prompting different yet fitting results).

Finally, we saw how to produce flags, hands with different skin tones, family icons and other emoji combinations supported by Unicode.

Further Reading

Ned Batchelder’s 2012 PyCon US talk “Pragmatic Unicode — or — How Do I Stop the Pain?” was outstanding. Ned is so professional that he provides a full transcript of the talk along with the slides and video. Esther Nam and Travis Fischer gave an excellent PyCon 2014 talk “Character encoding and Unicode in Python: How to (╯°□°)╯︵ ┻━┻ with dignity” (slides, video), from which I quoted this chapter’s short and sweet epigraph: “Humans use text. Computers speak bytes.” Lennart Regebro—one of this book’s technical reviewers—presents his “Useful Mental Model of Unicode (UMMU)” in the short post “Unconfusing Unicode: What Is Unicode?”. Unicode is a complex standard, so Lennart’s UMMU is a really useful starting point.

The official Unicode HOWTO in the Python docs approaches the subject from several different angles, from a good historic intro to syntax details, codecs, regular expressions, filenames, and best practices for Unicode-aware I/O (i.e., the Unicode sandwich), with plenty of additional reference links from each section. Chapter 4, “Strings”, of Mark Pilgrim’s awesome book Dive into Python 3 also provides a very good intro to Unicode support in Python 3. In the same book, Chapter 15 describes how the Chardet library was ported from Python 2 to Python 3, a valuable case study given that the switch from the old str to the new bytes is the cause of most migration pains, and that is a central concern in a library designed to detect encodings.

If you know Python 2 but are new to Python 3, Guido van Rossum’s What’s New in Python 3.0 has 15 bullet points that summarize what changed, with lots of links. Guido starts with the blunt statement: “Everything you thought you knew about binary data and Unicode has changed.” Armin Ronacher’s blog post “The Updated Guide to Unicode on Python” is deep and highlights some of the pitfalls of Unicode in Python 3 (Armin is not a big fan of Python 3).

Chapter 2, “Strings and Text,” of the Python Cookbook, Third Edition (O’Reilly), by David Beazley and Brian K. Jones, has several recipes dealing with Unicode normalization, sanitizing text, and performing text-oriented operations on byte sequences. Chapter 5 covers files and I/O, and it includes “Recipe 5.17. Writing Bytes to a Text File,” showing that underlying any text file there is always a binary stream that may be accessed directly when needed. Later in the cookbook, the struct module is put to use in “Recipe 6.11. Reading and Writing Binary Arrays of Structures.”

Nick Coghlan’s Python Notes blog has two posts very relevant to this chapter: “Python 3 and ASCII Compatible Binary Protocols” and “Processing Text Files in Python 3”. Highly recommended.

A list of encodings supported by Python is available at Standard Encodings in the codecs module documentation. If you need to get that list programmatically, see how it’s done in the /Tools/unicode/listcodecs.py script that comes with the CPython source code.

Martijn Faassen’s “Changing the Python Default Encoding Considered Harmful” and Tarek Ziadé’s “sys.setdefaultencoding Is Evil” explain why the default encoding you get from sys.getdefaultencoding() should never be changed, even if you discover how.

The books Unicode Explained by Jukka K. Korpela (O’Reilly) and Unicode Demystified by Richard Gillam (Addison-Wesley) are not Python-specific but were very helpful as I studied Unicode concepts. Programming with Unicode by Victor Stinner is a free, self-published book (Creative Commons BY-SA) covering Unicode in general as well as tools and APIs in the context of the main operating systems and a few programming languages, including Python.

The W3C pages Case Folding: An Introduction and Character Model for the World Wide Web: String Matching and Searching cover normalization concepts, with the former being a gentle introduction and the latter a working group note written in dry standard-speak—the same tone of the Unicode Standard Annex #15 — Unicode Normalization Forms. The Frequently Asked Questions / Normalization from Unicode.org is more readable, as is the NFC FAQ by Mark Davis—author of several Unicode algorithms and president of the Unicode Consortium at the time of this writing. To learn more about Unicode Emoji standards, visit the Unicode Emoji index page, which links to the Technical Standard #51: Unicode Emoji and the emoji data files, where you’ll find emoji-zwj-sequences.txt—the source of the samples I used in Figure 4-12.

Emojipedia is the best site to find emojis and learn about them. Besides a comprehensive searchable database, Emojipedia also has a blog including ports like Emoji ZWJ Sequences: Three Letters, Many Possibilities and Emoji Flags Explained.

In 2016, the Museum of Modern Art (MoMA) in NYC added to its collection The Original Emoji, the 176 emojis designed by Shigetaka Kurita in 1999 for NTT DOCOMO—the Japanese mobile carrier. Going further back in history, Emojipedia published Correcting the Record on the First Emoji Set, crediting Japan’s SoftBank for the earliest known emoji set, deployed in cell phones in 1997. SoftBank’s set is the source of 90 emojis now in Unicode, including U+1F4A9 (PILE OF POO). The culture and politics of emoji evolution in the 2010-2019 decade are the subject of Paddy Johnson’s article Emoji We Lost for Gizmodo. Matthew Rothenberg’s emojitracker.com is a live dashboard showing counts of emoji usage on Twitter, updated in real time. As I write this, FACE WITH TEARS OF JOY (U+1F602) is the most popular emoji on Twitter, with 2,693,102,686 recorded occurrences.

1 Slide 12 of PyCon 2014 talk “Character Encoding and Unicode in Python” (slides, video).

2 Python 2.6 and 2.7 also have bytes, but it’s just an alias to the str type, and does not behave like the Python 3 bytes type.

3 Trivia: the ASCII “single quote” character that Python uses by default as the string delimiter is actually named APOSTROPHE in the Unicode standard. The real single quotes are asymmetric: left is U+2018 and right is U+2019

4 It did not work in Python 3.0 to 3.4, causing much pain to developers dealing with binary data. The reversal is documented in PEP 461 — Adding % formatting to bytes and bytearray.

5 I first saw the term “Unicode sandwich” in Ned Batchelder’s excellent “Pragmatic Unicode” talk at US PyCon 2012.

6 Python 2.6 or 2.7 users have to use io.open() to get automatic decoding/encoding when reading/writing.

7 Source: Windows Command-Line: Unicode and UTF-8 Output Text Buffer.

8 While researching this subject, I did not find a list of situations when Python 3 internally converts bytes to str. Python core developer Antoine Pitrou says on the comp.python.devel list that CPython internal functions that depend on such conversions “don’t get a lot of use in py3k.”

9 The Python 2 sys.setdefaultencoding function was misused and is no longer documented in Python 3. It was intended for use by the core developers when the internal default encoding of Python was still undecided. In the same comp.python.devel thread, Marc-André Lemburg states that the sys.setdefaultencoding must never be called by user code and the only values supported by CPython are 'ascii' in Python 2 and 'utf-8' in Python 3.

10 Curiously, the micro sign is considered a “compatibility character” but the ohm symbol is not. The end result is that NFC doesn’t touch the micro sign but changes the ohm symbol to capital omega, while NFKC and NFKD change both the ohm and the micro into Greek characters.

11 Diacritics affect sorting only in the rare case when they are the only difference between two words—in that case, the word with a diacritic is sorted after the plain word.

12 Thanks to Leonardo Rochael who went beyond his duties as tech reviewer and researched these Windows details, even though he is a GNU/Linux user himself.

13 Again, I could not find a solution, but did find other people reporting the same problem. Alex Martelli, one of the tech reviewers, had no problem using setlocale and locale.strxfrm on his Mac with OSX 10.9. In summary: your mileage may vary.

14 That’s an image—not a code listing—because emojis are not well supported by O’Reilly’s digital publishing toolchain as I write this.

15 Although it was not better than re at identifying digits in this particular sample.

16 Definition quoted from Technical Standard #51 Unicode Emoji.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.137.10