Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The unicodedata Module

(New in 2.0) The unicodedata module contains Unicode character properties, such as character categories, decomposition data, and numerical values. Its use is shown in Example 8-3.

Example 8-3. Using the unicodedata Module

File: unicodedata-example-1.py

import unicodedata

for char in [u"A", u"-", u"1", u"N{LATIN CAPITAL LETTER O WITH DIAERESIS}"]:
    print repr(char),
    print unicodedata.category(char),
    print repr(unicodedata.decomposition(char)),

    print unicodedata.decimal(char, None),
    print unicodedata.numeric(char, None)

u'A' Lu '' None None
u'-' Pd '' None None
u'1' Nd '' 1 1.0
u'303226' Lu '004F 0308' None None

Note that in Python 2.0, properties for CJK ideographs and Hangul syllables are missing. This affects characters in the range 0x3400-0x4DB5, 0x4E00-0x9FA5, and 0xAC00-D7A3. The first character in each range has correct properties, so you can work around this problem by simply mapping each character to the beginning:

def remap(char):
    # fix for broken unicode property database in Python 2.0
    c = ord(char)
    if 0x3400 <= c <= 0x4DB5:
        return unichr(0x3400)
    if 0x4E00 <= c <= 0x9FA5:
        return unichr(0x4E00)
    if 0xAC00 <= c <= 0xD7A3:
        return unichr(0xAC00)
    return char

This bug has been fixed in Python 2.1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for The unicodedata Module

Create new playlist

Sign In

Sign Up

The unicodedata Module

Table of Contents for
The unicodedata Module