Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Detecting and converting character encodings

A common occurrence with text processing is finding text that has a non-standard character encoding. Ideally, all text would be ASCII or UTF-8, but that's just not the reality. In cases when you have non-ASCII or non-UTF-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before further processing it.

Getting ready

You'll need to install the chardet module, using sudo pip install chardet or sudo easy_install chardet. You can learn more about chardet at http://chardet.feedparser.org/.

How to do it...

Encoding detection and conversion functions are provided in encoding.py. These are simple wrapper functions around the chardet module. To detect the encoding of a string, call encoding.detect(). You'll get back a dict containing two attributes: confidence and encoding. confidence is a probability of how confident chardet is that the value for encoding is correct.

# -*- coding: utf-8 -*-
import chardet

def detect(s):
  try:
    return chardet.detect(s)
  except UnicodeDecodeError:
    return chardet.detect(s.encode('utf-8'))

  def convert(s):
    encoding = detect(s)['encoding']
    
    if encoding == 'utf-8':
      return unicode(s)
    else:
      return unicode(s, encoding)

Here's some example code using detect() to determine character encoding:

>>> import encoding
>>> encoding.detect('ascii')
{'confidence': 1.0, 'encoding': 'ascii'}
>>> encoding.detect(u'abcdé')
{'confidence': 0.75249999999999995, 'encoding': 'utf-8'}
>>> encoding.detect('222222223225')
{'confidence': 0.5, 'encoding': 'windows-1252'}

To convert a string to a standard unicode encoding, call encoding.convert(). This will decode the string from its original encoding, then re-encode it as UTF-8.

>>> encoding.convert('ascii')
u'ascii'	
>>> encoding.convert(u'abcdé')
u'abcd\xc3\xa9'
>>> encoding.convert('222222223225')
u'u2019u2019u201cu2022'

How it works...

The detect() function is a wrapper around chardet.detect() which can handle UnicodeDecodeError exceptions. In these cases, the string is encoded in UTF-8 before trying to detect the encoding.

The convert() function first calls detect() to get the encoding, then returns a unicode string with the encoding as the second argument. By passing the encoding into unicode(), the string is decoded from the original encoding, allowing it to be re-encoded into a standard encoding.

There's more...

The comment at the top of the module, # -*- coding: utf-8 -*-, is a hint to the Python interpreter, telling it which encoding to use for the strings in the code. This is helpful for when you have non-ASCII strings in your source code, and is documented in detail at http://www.python.org/dev/peps/pep-0263/.

Converting to ASCII

If you want pure ASCII text, with non-ASCII characters converted to ASCII equivalents, or dropped if there is no equivalent character, then you can use the unicodedata.normalize() function.

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'abcdxe9').encode('ascii', 'ignore')
'abcde'

Specifying 'NFKD' as the first argument ensures the non-ASCII characters are replaced with their equivalent ASCII versions, and the final call to encode() with 'ignore' as the second argument will remove any extraneous unicode characters.

Table of Contents for
Detecting and converting character encodings

Detecting and converting character encodings

Getting ready

How to do it...

How it works...

There's more...

Converting to ASCII

See also

Table of Contents for Detecting and converting character encodings

Create new playlist

Sign In

Sign Up

Detecting and converting character encodings

Getting ready

How to do it...

How it works...

There's more...

Converting to ASCII

See also

Table of Contents for
Detecting and converting character encodings