Python: Get Unicode Name, Codepoint

By Xah Lee. Date: . Last updated: .

Get Codepoint

Get Unicode character's codepoint.

# python 3

from unicodedata import *

# get codepoint of Unicode char in decimal
print(ord(u"→"))
# 8594

Get Name

Find character's Unicode name.

# python 3

from unicodedata import *

print(name(u"→"))
# RIGHTWARDS ARROW

Get Char

Get Unicode char of a given name.

# python 3

from unicodedata import *

char1 = lookup("GREEK SMALL LETTER ALPHA")
print(char1)
# α

char2 = lookup("RIGHTWARDS ARROW")
print(char2)
# 
char3 = lookup("CJK UNIFIED IDEOGRAPH-5929")
print(char3)
# 

Here's python 2:

# -*- coding: utf-8 -*-
# python 2

from unicodedata import *

char1 = lookup("GREEK SMALL LETTER ALPHA")
print(char1.encode('utf-8'))
# α

char2 = lookup("RIGHTWARDS ARROW")
print(char2.encode('utf-8'))
# 
char3 = lookup("CJK UNIFIED IDEOGRAPH-5929")
print(char3.encode('utf-8'))
# 

Intro of Unicode and UTF 8:

  1. Each char has a ID, called its codepoint. It's a integer.
  2. Each char has a unique name. (but a char may have a older name.)
  3. Each char has a number of properties, for example: Upper/lower case, direction (right-to-left languages), whether it's part of a combining char, whether it's a punctuation, etc.

The rest of functions in unicodedata module returns these properties.

[see Unicode Basics: Character Set, Encoding, UTF-8 ]

This page lets you search unicode. Unicode Search 😄

Print a Range of Unicode Chars

Here's a example that prints a range of Unicode chars, with their ordinal in hex, and name.

Chars without a name are skipped. (some of such are undefined codepoints.)

# python 3

from unicodedata import *

xlist=[]

for i in range(945, 969):
    xlist.append(eval('u"\\u%04x"' % i))

for x in xlist:
    if name(x,'-')!='-':
        print(x,'|', "%04x"%(ord(x)), '|', name(x,'-'))

# output
# α | 03b1 | GREEK SMALL LETTER ALPHA
# β | 03b2 | GREEK SMALL LETTER BETA
# γ | 03b3 | GREEK SMALL LETTER GAMMA
# δ | 03b4 | GREEK SMALL LETTER DELTA
# ε | 03b5 | GREEK SMALL LETTER EPSILON
# ζ | 03b6 | GREEK SMALL LETTER ZETA
# η | 03b7 | GREEK SMALL LETTER ETA
# θ | 03b8 | GREEK SMALL LETTER THETA
# ι | 03b9 | GREEK SMALL LETTER IOTA
# κ | 03ba | GREEK SMALL LETTER KAPPA
# λ | 03bb | GREEK SMALL LETTER LAMDA
# μ | 03bc | GREEK SMALL LETTER MU
# ν | 03bd | GREEK SMALL LETTER NU
# ξ | 03be | GREEK SMALL LETTER XI
# ο | 03bf | GREEK SMALL LETTER OMICRON
# π | 03c0 | GREEK SMALL LETTER PI
# ρ | 03c1 | GREEK SMALL LETTER RHO
# ς | 03c2 | GREEK SMALL LETTER FINAL SIGMA
# σ | 03c3 | GREEK SMALL LETTER SIGMA
# τ | 03c4 | GREEK SMALL LETTER TAU
# υ | 03c5 | GREEK SMALL LETTER UPSILON
# φ | 03c6 | GREEK SMALL LETTER PHI
# χ | 03c7 | GREEK SMALL LETTER CHI
# ψ | 03c8 | GREEK SMALL LETTER PSI

Python Text Processing

Python

Regex

Text Processing

Web

Misc