It's all Unicode

In the days of Python 2, the text data type str was used to support ASCII data, and for Unicode data, the language provided a unicode data type. When someone wanted to deal with a particular encoding, they took a string and encoded it into the required encoding scheme.

Also, the language inherently supported an implicit conversion of the string type to the unicode type. This is shown in the following code snippet:

str1 = 'Hello'
type(str1)        # type(str1) => 'str'
str2 = u'World'
type(str2)        # type(str2) => 'unicode'
str3 = str1 + str2
type(str3)        # type(str3) => 'unicode'

This used to work, because here, Python would implicitly decode the byte string str1 into Unicode using the default encoding and then perform a concatenation. One thing to note here is that if this str1 string contained any non-ASCII characters, then this concatenation would have failed in Python, raising a UnicodeDecodeError.

With the arrival of Python 3, the data types that dealt with text changed. Now, the default data type str which was used to store text supports Unicode. With this, Python 3 also introduced a binary data type, called bytes, which can be used to store binary data. These two types, str and bytes, are incompatible and no implicit conversion between them will happen, and any attempt to do so will give rise to TypeError, as shown in the following code:

str1 = 'I am a unicode string'
type(str1) # type(str1) => 'str'
str2 = b"And I can't be concatenated to a byte string"
type(str2) # type(str2) => 'bytes'
str3 = str1 + str2
-----------------------------------------------------------
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes

As we can see, an attempt to concatenate a unicode type string with a byte type string failed with TypeError. Although an implicit conversion of a string to a byte or a byte to a string is not possible, we do have methods that allow us to encode a string into a bytes type and decode a bytes type to a string. Look at the following code:

str1 = '₹100'
str1.encode('utf-8')
#b'xe2x82xb9100'
b'xe2x82xb9100'.decode('utf-8')
# '₹100'

This clear distinction between a string type and binary type with restrictions on implicit conversion allows for more robust code and fewer errors. But these changes also mean that any code that used to deal with the handling of Unicode in Python 2 will need to be rewritten in Python 3 because of the backward incompatibility.

Here, you should focus on the encoding and decoding format used to convert string to bytes and vice versa. Choosing a different formatting for encoding and decoding can result in the loss of important information, and can result in corrupt data.

Table of Contents for It's all Unicode

Create new playlist

Sign In

Sign Up

Table of Contents for
It's all Unicode