Python: Unicode Tutorial 🐍

By Xah Lee. Date: . Last updated: .

Python Source Code Encoding

What's Python source code's default encoding?

For Python 3.x, it's UTF-8.

For Python 2.x, it's ASCII. [see ASCII Table]

Python 2: If your source code contains non-ASCII characters, you must declare the file's encoding in the first line or second line, Like this:

-*- coding: utf-8 -*-

# -*- coding: utf-8 -*-
# python 2

x = u"i ♥ cats"

print x

The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. It tells any program reading the file that the file is encoded using a particular encoding.

If you don't know unicode, read this first: Unicode Basics: Character Set, Encoding, UTF-8, Codepoint

See also:

Unicode in Python 3

Python 3's string is a sequence of unicode characters.

In Python 3, everything is Unicode (UTF-8). You do not need # -*- coding: utf-8 -*-, nor u"…".

Python 3 supports Unicode in variable and function names.

# python 3

def ƒ(n):
    return n+1

α = 4
print(ƒ(α))
# 5

Note, unicode that are not letters are not allowed.

# python 3

♥ = 4

print(♥)

#     ♥ = 4
#     ^
# SyntaxError: invalid character in identifier

Python 2: Declare Unicode String

For Python 2, strings that contain Unicode characters must start with u in front of the string.

For Python 3, any string quote can begin with u, example: u"xyz", but it has no meaning. Any string is already a Unicode datatype.

# -*- coding: utf-8 -*-
# python 2

aa = u"I ♥ U"  # unicode string starts with “u”

print aa # I ♥ U

The u makes the string a Unicode datatype. Without the u, string is just byte sequence.

Sometimes when you print Unicode strings, you may get a error like this:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to use the .encode() or .decode() method.

# -*- coding: utf-8 -*-
# python 2

myStr = u'α'

# Bad. This is a error.
print 'Greek alpha: ', myStr

# Good
print 'Greek alpha: ', myStr.encode('utf-8')

Unicode Escape Sequence in String

Both python 3 and python 2 can have a unicode characters literally in a string. for example, "i ♥ u" or u"i ♥ u".

You can also use escape sequences. There are 2 forms:

\u4_digits_hex
For a char whose unicode codepoint can be expressed in 4 hexadecimal decimals. If less than 4 digits, add 0 in front.
\U8_digits_hex
For a char whose unicode codepoint is more than 4 hexadecimal decimals. If the char's hexadecimal digits is less than 8 digits, you must add 0 in front to make a total of 8 digits.
# python 3

# BLACK HEART SUIT, hexadecimal 2665
x = "♥"
y = "\u2665"

print(x == y)
# True

# unicode escape sequence, for char with more than 4 hexadecimal digits

# GRINNING CAT FACE WITH SMILING EYES, hexadecimal 1f638
x = "😸"
y = "\U0001f638"

print(x == y)
# True
# -*- coding: utf-8 -*-
# python 2

# BLACK HEART SUIT, hexadecimal 2665
x = u"♥"
y = u"\u2665"

print x == y
# True

# unicode escape sequence, for char with more than 4 hexadecimal digits

# GRINNING CAT FACE WITH SMILING EYES, hexadecimal 1f638
x = u"😸"
y = u"\U0001f638"

print x == y
# True

Python 2: Unicode in Regex

When using regex on Unicode string, and you want the word patterns {\w, \W} and boundary patterns {\b, \B}, dependent on the Unicode character properties, you need to add the Unicode flag re.U when calling regex functions.

# -*- coding: utf-8 -*-
# python 2

# example showing the difference of using re.U regex flag

import re

rr = re.findall(r"\w+", u"♥αβγ!", re.U)

if rr:
    print rr
else:
    print "no match"

# prints [u'\u03b1\u03b2\u03b3']

# if re.U is not used, it prints 「no match」 because the 「\w+」 pattern for “word” only consider ASCII letters

See: Python Regex Flags .

Find Replace Unicode Char in String

# -*- coding: utf-8 -*-
# python 2

# example of finding all unicode char in a string

import re

ss = u"i♥NY 😸"

# find all unicode chars
myResult = re.findall(u"[^\u0000-\u007e]+", ss)

if myResult:
    print myResult  # [u'\u2665', u'\U0001f638']
else:
    print "no match"

Find Unicode Character's Name, Codepoint, Properties

Python: Get Unicode Name, Codepoint

Convert File Encoding

Python: Convert File Encoding

Unicode Characters Search

Unicode Search 😄

[see Unicode Basics: Character Set, Encoding, UTF-8]

Python Text Processing

Python

Regex

Text Processing

Web

Misc