Detecting and converting character encodings

A common occurrence with text processing is finding text that has a non-standard character encoding. Ideally, all text would be ASCII or UTF-8, but that's just not the reality. In cases when you have non-ASCII or non-UTF-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before further processing it.

Getting ready

You'll need to install the chardet module, using sudo pip install chardet or sudo easy_install chardet. You can learn more about chardet at http://chardet.feedparser.org/.

How to do it...

Encoding detection and conversion functions are provided in encoding.py. These are simple wrapper functions around the chardet module. To detect the encoding of a string, call encoding.detect(). You'll get back a dict containing two attributes: confidence and encoding. confidence is a probability of how confident chardet is that the value for encoding is correct.

# -*- coding: utf-8 -*-
import chardet

def detect(s):
  try:
    return chardet.detect(s)
  except UnicodeDecodeError:
    return chardet.detect(s.encode('utf-8'))

  def convert(s):
    encoding = detect(s)['encoding']
    
    if encoding == 'utf-8':
      return unicode(s)
    else:
      return unicode(s, encoding)

Here's some example code using detect() to determine character encoding:

>>> import encoding
>>> encoding.detect('ascii')
{'confidence': 1.0, 'encoding': 'ascii'}
>>> encoding.detect(u'abcdé')
{'confidence': 0.75249999999999995, 'encoding': 'utf-8'}
>>> encoding.detect('222222223225')
{'confidence': 0.5, 'encoding': 'windows-1252'}

To convert a string to a standard unicode encoding, call encoding.convert(). This will decode the string from its original encoding, then re-encode it as UTF-8.

>>> encoding.convert('ascii')
u'ascii'	
>>> encoding.convert(u'abcdé')
u'abcd\xc3\xa9'
>>> encoding.convert('222222223225')
u'u2019u2019u201cu2022'

How it works...

The detect() function is a wrapper around chardet.detect() which can handle UnicodeDecodeError exceptions. In these cases, the string is encoded in UTF-8 before trying to detect the encoding.

The convert() function first calls detect() to get the encoding, then returns a unicode string with the encoding as the second argument. By passing the encoding into unicode(), the string is decoded from the original encoding, allowing it to be re-encoded into a standard encoding.

There's more...

The comment at the top of the module, # -*- coding: utf-8 -*-, is a hint to the Python interpreter, telling it which encoding to use for the strings in the code. This is helpful for when you have non-ASCII strings in your source code, and is documented in detail at http://www.python.org/dev/peps/pep-0263/.

Converting to ASCII

If you want pure ASCII text, with non-ASCII characters converted to ASCII equivalents, or dropped if there is no equivalent character, then you can use the unicodedata.normalize() function.

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'abcdxe9').encode('ascii', 'ignore')
'abcde'

Specifying 'NFKD' as the first argument ensures the non-ASCII characters are replaced with their equivalent ASCII versions, and the final call to encode() with 'ignore' as the second argument will remove any extraneous unicode characters.

See also

Encoding detection and conversion is a recommended first step before doing HTML processing with lxml or BeautifulSoup, covered in the Extracting URLs from HTML with lxml and Converting HTML entities with BeautifulSoup recipes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.108.68