A common occurrence with text processing is finding text that has a non-standard character encoding. Ideally, all text would be ASCII or UTF-8, but that's just not the reality. In cases when you have non-ASCII or non-UTF-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before further processing it.
You'll need to install the chardet
module, using sudo pip install chardet
or sudo easy_install chardet
. You can learn more about chardet
at http://chardet.feedparser.org/.
Encoding detection and conversion functions are provided in encoding.py
. These are simple wrapper functions around the chardet
module. To detect the encoding of a string, call encoding.detect()
. You'll get back a dict
containing two attributes: confidence
and encoding
. confidence
is a probability of how confident chardet
is that the value for encoding
is correct.
# -*- coding: utf-8 -*- import chardet def detect(s): try: return chardet.detect(s) except UnicodeDecodeError: return chardet.detect(s.encode('utf-8')) def convert(s): encoding = detect(s)['encoding'] if encoding == 'utf-8': return unicode(s) else: return unicode(s, encoding)
Here's some example code using detect()
to determine character encoding:
>>> import encoding >>> encoding.detect('ascii') {'confidence': 1.0, 'encoding': 'ascii'} >>> encoding.detect(u'abcdé') {'confidence': 0.75249999999999995, 'encoding': 'utf-8'} >>> encoding.detect('222222223225') {'confidence': 0.5, 'encoding': 'windows-1252'}
To convert a string to a standard unicode
encoding, call encoding.convert()
. This will decode the string from its original encoding, then re-encode it as UTF-8.
>>> encoding.convert('ascii') u'ascii' >>> encoding.convert(u'abcdé') u'abcd\xc3\xa9' >>> encoding.convert('222222223225') u'u2019u2019u201cu2022'
The detect()
function is a wrapper around
chardet.detect()
which can handle UnicodeDecodeError
exceptions. In these cases, the string is encoded in UTF-8 before trying to detect the encoding.
The convert()
function first calls detect()
to get the encoding
, then returns a unicode
string with the encoding
as the second argument. By passing the encoding
into unicode()
, the string is decoded from the original encoding, allowing it to be re-encoded into a standard encoding.
The comment at the top of the module, # -*- coding: utf-8 -*-
, is a hint to the Python interpreter, telling it which encoding to use for the strings in the code. This is helpful for when you have non-ASCII strings in your source code, and is documented in detail at http://www.python.org/dev/peps/pep-0263/.
If you want pure ASCII text, with non-ASCII characters converted to ASCII equivalents, or dropped if there is no equivalent character, then you can use the
unicodedata.normalize()
function.
>>> import unicodedata >>> unicodedata.normalize('NFKD', u'abcdxe9').encode('ascii', 'ignore') 'abcde'
Specifying 'NFKD'
as the first argument ensures the non-ASCII characters are replaced with their equivalent ASCII versions, and the final call to encode()
with 'ignore'
as the second argument will remove any extraneous unicode characters.
3.133.108.68