Python Unicode Tutorial 🐍

, , …,

If you are not familiar with Unicode, first see: Unicode Basics: What's Character Set, Character Encoding, UTF-8?

Unicode Characters in Source Code

What's Python source code's default encoding?

For Python 2.x, it's ASCII.

For Python 3.x, it's UTF-8.

2. Lexical analysis — Python v2.7.6 documentation

http://legacy.python.org/dev/peps/pep-0263/

If your source code contains Unicode characters, you must declare the file's encoding -*- coding: utf-8 -*- in the first line or second line. Like this:

#-*- coding: utf-8 -*-
# python 2
u"I ♥ Cats"

The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. 〔➤ Emacs and Unicode Tips〕 It tells any program reading the file that the file is encoded using a particular character set. Its purpose is similar to HTML's <meta http-equiv="content-type" content="text/html; charset=utf-8" />. (See: Character Sets and Encoding in HTMLUNICODE Basics.) This convention is also used by Ruby. 〔➤ Ruby Unicode Tutorial 💎

Text Processing with Unicode Strings

Strings that contain Unicode characters must start with u in front of the string. Example:

#-*- coding: utf-8 -*-
# python 2

aa = u"I ♥ U"     # unicode string, start with “u” or “ur”

print(aa)         # I ♥ U

The u makes the string a Unicode datatype. Without the u, string is just byte sequence.

Sometimes when you print Unicode strings, you may get a error like this:

# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to use the .encode() or .decode() method. Example:

#-*- coding: utf-8 -*-
# python 2

myStr = u'α'

# # Bad. This is a error.
print 'Greek alpha: ', myStr

# Good
print 'Greek alpha: ', myStr.encode('utf-8')

Unicode in Regex

When using regex on Unicode string, and you want the patterns {\w, \W, \b, \B} dependent on the Unicode character properties , you need to add the Unicode flag re.U when calling regex functions.

# -*- coding: utf-8 -*-
# python 2

import re

rr = re.search(r"\w+", u"♥αβγ!", re.U)

if rr:
    print rr.group().encode("utf8")
else:
    print "no match"

# prints 「αβγ」

# if re.U is not used, it prints 「no match」 because the 「\w+」 pattern for “word” only consider ASCII letters

See: Python Regex Flags.

Unicode in Python 3

In Python 3, everything is Unicode (UTF-8). You do not need to use # -*- coding: utf-8 -*-, nor need to have u"…" in string.

Python 3 supports Unicode in variable & function names.

# -*- coding: utf-8 -*- ← optional, but still good to indicate encoding
# python 3

def ƒ(n):
    return n+1

α = 4
print(ƒ(α))                     # prints 5
blog comments powered by Disqus