Python 2: Unicode Tutorial

By Xah Lee. Date: . Last updated: .

For python 3, see Python: Unicode Tutorial 🐍

Python 2 Source Code Encoding

if your source code contains literal Unicode Characters, such as , the file must start with the line

# -*- coding: utf-8 -*-

Python: Source Code Encoding

Python 2, String Containing Unicode, Declare Unicode String

If your string contain literal Unicode Characters , such as (U+2665: BLACK HEART SUIT) , then you must prefix your string with u , e.g. u"I ♥ U".

The u makes the string a Unicode datatype. Without the u, string is just byte sequence.

The r and u can be combined, like this: ur"I ♥ Python"

# -*- coding: utf-8 -*-
# python 2

import sys
print( sys.version)

# unicode string starts with u
aa = u"I ♥ U"

print aa.encode('utf-8')
# I ♥ U

Sometimes when you print Unicode strings, you may get a error like this:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to use the .encode() or .decode() method.

# -*- coding: utf-8 -*-
# python 2

myStr = u'α'

# Bad. This is a error.
print 'Greek alpha: ', myStr

# Good
print 'Greek alpha: ', myStr.encode('utf-8')

Python 2: Unicode in Regex

When using regex on Unicode string, and you want the word patterns {\w, \W} and boundary patterns {\b, \B}, dependent on the Unicode character properties, you need to add the Unicode flag re.U when calling regex functions.

# -*- coding: utf-8 -*-
# python 2

# example showing the difference of using re.U regex flag

import re

rr = re.findall(r"\w+", u"♥αβγ!", re.U)

if rr:
    print rr
    print "no match"

# prints [u'\u03b1\u03b2\u03b3']

# if re.U is not used, it prints “no match” because the \w+ pattern for “word” only consider ASCII letters

See: Python: Regex Flags .

Find Replace Unicode Char in String

# -*- coding: utf-8 -*-
# python 2

# example of finding all unicode char in a string

import re

ss = u"i♥NY 😸"

# find all unicode chars
myResult = re.findall(u"[^\u0000-\u007e]+", ss)

if myResult:
    print myResult  # [u'\u2665', u'\U0001f638']
    print "no match"

Python Unicode