Python: Processing Unicode: unicodedata Module Tutorial

, , …,

How to get Unicode character's codepoint in decimal?

#-*- coding: utf-8 -*-
# python 2

from unicodedata import *

# get code point of Unicode char (in decimal)
print ord(u"→")                 # 8594

7.9. unicodedata — Unicode Database — Python v2.7.6 documentation #

How to find a character's Unicode name?

#-*- coding: utf-8 -*-
# python 2

from unicodedata import *

# find Unicode char's name
print name(u"→")                # RIGHTWARDS ARROW

How to get the Unicode char of a given name?

#-*- coding: utf-8 -*-
# python 2

from unicodedata import *

char1 = lookup("GREEK SMALL LETTER ALPHA") # should be UPPER CASE
print char1.encode('utf-8')     # α

char2 = lookup("RIGHTWARDS ARROW") # should be UPPER CASE
print char2.encode('utf-8')     # 
char3 = lookup("CJK UNIFIED IDEOGRAPH-5929") # doesn't work if not UPPER CASE
print char3.encode('utf-8')     # 
# Unicode name is UPPER CASE by spec

Here's a short intro of Unicode:

The rest of functions in unicodedata module returns these properties.

Print a Range of Unicode Chars

Here's a snippet of code that prints a range of Unicode chars, along with their ordinal in hex, and name.

Chars without a name are skipped. (some of such are undefined codepoints.)

# -*- coding: utf-8 -*-
# python 2

from unicodedata import *

xlist=[]

for i in range(945, 969):
    xlist.append(eval('u"\\u%04x"' % i))

for x in xlist:
    if name(x,'-')!='-':
        print x.encode('utf-8'),'|', "%04x"%(ord(x)), '|', name(x,'-')

# output
"""
α | 03b1 | GREEK SMALL LETTER ALPHA
β | 03b2 | GREEK SMALL LETTER BETA
γ | 03b3 | GREEK SMALL LETTER GAMMA
δ | 03b4 | GREEK SMALL LETTER DELTA
ε | 03b5 | GREEK SMALL LETTER EPSILON
ζ | 03b6 | GREEK SMALL LETTER ZETA
η | 03b7 | GREEK SMALL LETTER ETA
θ | 03b8 | GREEK SMALL LETTER THETA
ι | 03b9 | GREEK SMALL LETTER IOTA
κ | 03ba | GREEK SMALL LETTER KAPPA
λ | 03bb | GREEK SMALL LETTER LAMDA
μ | 03bc | GREEK SMALL LETTER MU
ν | 03bd | GREEK SMALL LETTER NU
ξ | 03be | GREEK SMALL LETTER XI
ο | 03bf | GREEK SMALL LETTER OMICRON
π | 03c0 | GREEK SMALL LETTER PI
ρ | 03c1 | GREEK SMALL LETTER RHO
ς | 03c2 | GREEK SMALL LETTER FINAL SIGMA
σ | 03c3 | GREEK SMALL LETTER SIGMA
τ | 03c4 | GREEK SMALL LETTER TAU
υ | 03c5 | GREEK SMALL LETTER UPSILON
φ | 03c6 | GREEK SMALL LETTER PHI
χ | 03c7 | GREEK SMALL LETTER CHI
ψ | 03c8 | GREEK SMALL LETTER PSI
"""

Print Unicode Symbols Whose Name Contains STAR

Here's a example by 馬曉駿 https://gist.github.com/10622337

# -*- coding: utf-8 -*-
# python 2

# print all unicode chars whose name contains "STAR"
# 2014-04-14 by 馬曉駿 https://gist.github.com/10622337

from unicodedata import name

bullets = list()

for i in range(0x10000):
    try:
        c = unichr(i)
        if 'STAR' in name(c):
            bullets.append(c)
    except:
        pass

bullets.sort(key = lambda c:name(c))
for c in bullets:
    print name(c), c.encode("utf-8")

# output
"""
APL FUNCTIONAL SYMBOL CIRCLE STAR ⍟
APL FUNCTIONAL SYMBOL STAR DIAERESIS ⍣
ARABIC FIVE POINTED STAR ٭
ARABIC START OF RUB EL HIZB ۞
BLACK CENTRE WHITE STAR ✬
BLACK FOUR POINTED STAR ✦
BLACK SMALL STAR ⭑
BLACK STAR ★
CIRCLED OPEN CENTRE EIGHT POINTED STAR ❂
CIRCLED WHITE STAR ✪
EIGHT POINTED BLACK STAR ✴
EIGHT POINTED PINWHEEL STAR ✵
EIGHT POINTED RECTILINEAR BLACK STAR ✷
GLEICH STARK ⧦
HEAVY EIGHT POINTED RECTILINEAR BLACK STAR ✸
HEAVY OUTLINED BLACK STAR ✮
OPEN CENTRE BLACK STAR ✫
OUTLINED BLACK STAR ✭
OUTLINED WHITE STAR ⚝
PINWHEEL STAR ✯
SHADOWED WHITE STAR ✰
SIX POINTED BLACK STAR ✶
STAR AND CRESCENT ☪
STAR EQUALS ≛
STAR OF DAVID ✡
STAR OPERATOR ⋆
STRESS OUTLINED WHITE STAR ✩
SYMBOL FOR START OF HEADING ␁
SYMBOL FOR START OF TEXT ␂
TIBETAN MARK DELIMITER TSHEG BSTAR ༌
TWELVE POINTED BLACK STAR ✹
WHITE FOUR POINTED STAR ✧
WHITE MEDIUM STAR ⭐
WHITE SMALL STAR ⭒
WHITE STAR ☆
"""

For a emacs solution, without writing any emacs lisp, see: Emacs Keyboard Macro Example: Insert All Unicode Bullets

blog comments powered by Disqus