MathCurvesSurfacesWallpaper GroupsGallerySoftwarePOV-Ray
ProgramingLinuxPerl PythonHTMLCSSJavaScriptPHPJavaEmacsUnicode ♥
Web Hosting by 1&1

Python's unicodedata Module

Xah Lee,

Python has a unicodedata module. Here's a example:

#-*- coding: utf-8 -*-
# python

from unicodedata import *

# each Unicode char has a unique name.
# one can use the “lookup” func to find it

mychar=lookup('greek cApital letter sIgma') # note letter case doesn't matter
print mychar.encode('utf-8')


m=lookup('CJK UNIFIED IDEOGRAPH-5929') # for some reason, case must be right here.
print m.encode('utf-8')

# to find a char's name, use the “name” function
print name(u'天')

# to get code point in decimal of Unicode char, use the standard function ord
print ord(u'天')

Basically, in unicode, each char has a number of attributes called properties. The char's name is one of its properties. These attributes provide necessary info to form letters, words, sentences, or for processing purposes such as sorting, capitalization, etc. For example, letters in English alphabet has two upper case and lower case forms. Given a char, you need to know which form it is, and what's the corresponding form. Korean alphabets are stacked together. While many symbols corresponds to numbers such as circled digit “①”, and there are also combining forms used for example to put a bar over any letter or character. Also some writings systems are directional. In order to form these symbols for display or process them for computing, info of these on each char is necessary.

The rest of functions in unicodedata returns these attributes.

http://docs.python.org/lib/module-unicodedata.html

Official doc on Unicode character properties: http://www.unicode.org/uni2book/ch04.pdf

Print a Range of Unicode Chars

Here's a snippet of code that prints a range of Unicode chars, along with their ordinal in hex, and name.

Chars without a name are skipped. (some of such are undefined code points.)

On Microsoft Windows the encoding might need to be changed to utf-16.

Change the range to see different Unicode chars.

# -*- coding: utf-8 -*-
# python

from unicodedata import *
l=[]
for i in range(0x0000, 0x0fff):
    l.append(eval('u"\\u%04x"' % i))

for x in l:
    if name(x,'-')!='-':
        print x.encode('utf-8'),'|', "%04x"%(ord(x)), '|', name(x,'-')
blog comments powered by Disqus