βˆ‘ Math γ€–Curves β—† Surfaces β—† Wallpaper Groups β—† Gallery β—† Software β—† POV-Rayγ€—
Programing γ€–Linux β—† Perl Python β—† HTML β—† CSS β—† JavaScript β—† PHP β—† Java β—† Emacs β—† ⌨ β—† Unicode β™₯γ€—
Web Hosting by 1οΌ†1

Unicode in Python 🐍

Xah Lee, , …,

If you are not familiar with Unicode, first see: UNICODE Basics: What's Character Set, Character Encoding, UTF-8, and All That?

This pages covers Python 2.7. For Python 3, see bottom.

Unicode Characters in Source Code

If your source code contains Unicode characters, you must declare the file's encoding -*- coding: utf-8 -*- in the first line or second line. Like this:

#-*- coding: utf-8 -*-
# python 2
u"I β™₯ Cats"

The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. γ€”β˜›Β Emacs and Unicode Tips〕 It tells any program reading the file that the file is encoded using a particular character set. Its purpose is similar to HTML's <meta http-equiv="content-type" content="text/html; charset=utf-8" />. (See: Character Sets and Encoding in HTML β—‡ UNICODE Basics.) This convention is also used by Ruby. γ€”β˜›Β Unicode in Ruby πŸ’Žγ€•

Text Processing with Unicode Strings

Strings that contain Unicode characters must start with u in front of the string. Example:

#-*- coding: utf-8 -*-
# python 2

aa = u"I β™₯ U"                 # unicode string, start with β€œu” or β€œur”

print(aa)                        # I β™₯ U

The u makes the string a Unicode datatype. Without the u, string is just byte sequence.

Sometimes when you print Unicode strings, you may get a error like this:

# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to use the β€œ.encode()” or β€œ.decode()” method. Example:

#-*- coding: utf-8 -*-
# python 2

myStr = u'Ξ±'

# # Bad. This is a error.
print 'Greek alpha: ', myStr

# Good
print 'Greek alpha: ', myStr.encode('utf-8')

Unicode in Regex

When using regex on Unicode string, and you want the pattern characters {\w, \W, \b, \B} dependent on the Unicode character properties , you need to add the Unicode flag re.U when calling regex functions.

# -*- coding: utf-8 -*-
# python

import re

rr = re.search(r'\w+', u'β™₯Ξ±Ξ²Ξ³!', re.U)

if rr:
    print rr.group().encode('utf8')
else:
    print "no match"
# prints γ€ŒΞ±Ξ²Ξ³γ€, but if re.U is not used, it prints γ€Œno match」 because the γ€Œ\w+」 pattern for β€œword” only consider ASCII letters

See: Python Regex Flags.

Unicode in Python 3

In Python 3, everything is Unicode (utf-8). You do not need to use # -*- coding: utf-8 -*-, nor need to have β€œu” in string.

Python 3 supports Unicode in variable and function names. Example:

# -*- coding: utf-8 -*-          ← optional, but still good to have
# python 3

def Ζ’(n):
    return n+1

Ξ± = 4
print(Ζ’(Ξ±)) # prints 5
blog comments powered by Disqus