Python: Unicode Tutorial 🐍

By Xah Lee. Date: . Last updated: .

Unicode Characters in Source Code

What is Python source code's default encoding?

For Python 2.x, it's ASCII. [see ASCII Table]

For Python 3.x, it's UTF-8.

Python 2: If your source code contains Unicode characters, you must declare the file's encoding in the first line or second line, Like this:

-*- coding: utf-8 -*-

#-*- coding: utf-8 -*-
# python 2

x = u"i β™₯ cats"

print x

The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. [see Emacs and Unicode Tips] It tells any program reading the file that the file is encoded using a particular encoding. [see HTML: Character Sets and Encoding] This convention is also used by Ruby. [see Ruby: Unicode Tutorial πŸ’Ž]

For a basic introduction to Unicode, see: Unicode Basics: What's Character Set, Character Encoding, UTF-8?

Python 2: Declare Unicode String

For Python 2, strings that contain Unicode characters must start with u in front of the string.

#-*- coding: utf-8 -*-
# python 2

aa = u"I β™₯ U"      # unicode string starts with β€œu”

print aa           # I β™₯ U

The u makes the string a Unicode datatype. Without the u, string is just byte sequence.

Sometimes when you print Unicode strings, you may get a error like this:

# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to use the .encode() or .decode() method.

#-*- coding: utf-8 -*-
# python 2

myStr = u'Ξ±'

# Bad. This is a error.
print 'Greek alpha: ', myStr

# Good
print 'Greek alpha: ', myStr.encode('utf-8')

Unicode Escape Sequence in String

You can have a unicode character literally in a string, such as u'i β™₯ u'.

You can also use escape sequences. There are 2 forms:

# -*- coding: utf-8 -*-
# python 2

x = u"β™₯"             # BLACK HEART SUIT, hex 2665
y = u"\u2665"

print x == y                     # True
# -*- coding: utf-8 -*-
# python 2

# unicode escape sequence, for char with more than 4 hex digits

x = u"😸"      # name: GRINNING CAT FACE WITH SMILING EYES, hex 1f638
y = u"\U0001f638"

print x == y                     # True

2. Lexical analysis β€” Python v2.7.6 documentation#string-literals

Unicode in Regex

When using regex on Unicode string, and you want the word patterns {\w, \W} and boundary patterns {\b, \B}, dependent on the Unicode character properties, you need to add the Unicode flag re.U when calling regex functions.

# -*- coding: utf-8 -*-
# python 2

# example showing the difference of using re.U regex flag

import re

rr = re.findall(r"\w+", u"β™₯Ξ±Ξ²Ξ³!", re.U)

if rr:
    print rr
    print "no match"

# prints [u'\u03b1\u03b2\u03b3']

# if re.U is not used, it prints γ€Œno match」 because the γ€Œ\w+」 pattern for β€œword” only consider ASCII letters

See: Python Regex Flags.

Find/Replace Unicode Char in String

# -*- coding: utf-8 -*-
# python 2

# example of finding all unicode char in a string

import re

ss = u"iβ™₯NY 😸"

# find all unicode chars
myResult = re.findall(u"[^\u0000-\u007e]+", ss)

if myResult:
    print myResult              # [u'\u2665', u'\U0001f638']
    print "no match"

Find/Replace Strings in Unicode Files

Python: Find Replace Strings in Unicode Files

Find Unicode Character's Name, Codepoint, Properties

See: Python: Process Unicode, unicodedata Module.

Convert File's Encoding

See: Python: Convert File Encoding.

Unicode in Python 3

In Python 3, everything is Unicode (UTF-8). You do not need # -*- coding: utf-8 -*-, nor u"…".

Python 3 supports Unicode in variable and function names.

# -*- coding: utf-8 -*- ← optional, but still good to indicate encoding
# python 3

def Ζ’(n):
    return n+1

Ξ± = 4
print(Ζ’(Ξ±))                     # prints 5

Note, unicode that are not letters are not allowed.

# -*- coding: utf-8 -*-
# python 3

β™₯ = 4


#     β™₯ = 4
#     ^
# SyntaxError: invalid character in identifier

Unicode Characters Search

Unicode Characters βˆ‘ β™₯ πŸ˜„

[see Unicode Basics: What's Character Set, Character Encoding, UTF-8?]

If you have a question, put $5 at patreon and message me.