Python: Get Unicode Name, Codepoint
Get Codepoint from Char
Get Unicode character's codepoint.
from unicodedata import * # get codepoint of Unicode char in decimal print(ord(u"→")) # 8594
Get Name from Char
Find character's Unicode name.
from unicodedata import * print(name(u"→")) # RIGHTWARDS ARROW
Get Char from Name
Get Unicode char of a given name.
from unicodedata import * char1 = lookup("GREEK SMALL LETTER ALPHA") print(char1) # α char2 = lookup("RIGHTWARDS ARROW") print(char2) # → char3 = lookup("CJK UNIFIED IDEOGRAPH-5929") print(char3) # 天
Here's python 2:
# -*- coding: utf-8 -*- # python 2 from unicodedata import * char1 = lookup("GREEK SMALL LETTER ALPHA") print(char1.encode('utf-8')) # α char2 = lookup("RIGHTWARDS ARROW") print(char2.encode('utf-8')) # → char3 = lookup("CJK UNIFIED IDEOGRAPH-5929") print(char3.encode('utf-8')) # 天
Intro of Unicode and UTF 8:
- Each char has a ID, called its codepoint. It's a integer.
- Each char has a unique name. (but a char may have a older name.)
- Each char has a number of properties, for example: Upper/lower case, direction (right-to-left languages), whether it's part of a combining char, whether it's a punctuation, etc.
The rest of functions in unicodedata module returns these properties.
[see Unicode Basics: Character Set, Encoding, UTF-8 ]
This page lets you search unicode. Unicode Search 😄
Print a Range of Unicode Chars
Here's a example that prints a range of Unicode chars, with their ordinal in hex, and name.
Chars without a name are skipped. (some of such are undefined codepoints.)
from unicodedata import * xlist=[] for i in range(945, 969): xlist.append(eval('u"\\u%04x"' % i)) for x in xlist: if name(x,'-')!='-': print(x,'|', "%04x"%(ord(x)), '|', name(x,'-')) # output # α | 03b1 | GREEK SMALL LETTER ALPHA # β | 03b2 | GREEK SMALL LETTER BETA # γ | 03b3 | GREEK SMALL LETTER GAMMA # δ | 03b4 | GREEK SMALL LETTER DELTA # ε | 03b5 | GREEK SMALL LETTER EPSILON # ζ | 03b6 | GREEK SMALL LETTER ZETA # η | 03b7 | GREEK SMALL LETTER ETA # θ | 03b8 | GREEK SMALL LETTER THETA # ι | 03b9 | GREEK SMALL LETTER IOTA # κ | 03ba | GREEK SMALL LETTER KAPPA # λ | 03bb | GREEK SMALL LETTER LAMDA # μ | 03bc | GREEK SMALL LETTER MU # ν | 03bd | GREEK SMALL LETTER NU # ξ | 03be | GREEK SMALL LETTER XI # ο | 03bf | GREEK SMALL LETTER OMICRON # π | 03c0 | GREEK SMALL LETTER PI # ρ | 03c1 | GREEK SMALL LETTER RHO # ς | 03c2 | GREEK SMALL LETTER FINAL SIGMA # σ | 03c3 | GREEK SMALL LETTER SIGMA # τ | 03c4 | GREEK SMALL LETTER TAU # υ | 03c5 | GREEK SMALL LETTER UPSILON # φ | 03c6 | GREEK SMALL LETTER PHI # χ | 03c7 | GREEK SMALL LETTER CHI # ψ | 03c8 | GREEK SMALL LETTER PSI