If you are not familiar with Unicode, first see: UNICODE Basics: What's Character Set, Character Encoding, UTF-8, and All That?
This pages covers Python 2.7. For Python 3, see bottom.
If your source code contains Unicode characters, you must declare the file's encoding -*- coding: utf-8 -*- in the first line or second line. Like this:
#-*- coding: utf-8 -*- # python 2 u"I β₯ Cats"
The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. γβΒ Emacs and Unicode Tipsγ It tells any program reading the file that the file is encoded using a particular character set. Its purpose is similar to HTML's
<meta http-equiv="content-type" content="text/html; charset=utf-8" />.
(See: Character Sets and Encoding in HTML β
UNICODE Basics.)
This convention is also used by Ruby.
γβΒ Unicode in Ruby πγ
Strings that contain Unicode characters must start with u in front of the string. Example:
#-*- coding: utf-8 -*- # python 2 aa = u"I β₯ U" # unicode string, start with βuβ or βurβ print(aa) # I β₯ U
The u makes the string a Unicode datatype. Without the u, string is just byte sequence.
Sometimes when you print Unicode strings, you may get a error like this:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).
The solution is to use the β.encode()β or β.decode()β method. Example:
#-*- coding: utf-8 -*- # python 2 myStr = u'Ξ±' # # Bad. This is a error. print 'Greek alpha: ', myStr # Good print 'Greek alpha: ', myStr.encode('utf-8')
When using regex on Unicode string, and you want the pattern characters {\w, \W, \b, \B} dependent on the Unicode character properties , you need to add the Unicode flag re.U when calling regex functions.
# -*- coding: utf-8 -*- # python import re rr = re.search(r'\w+', u'β₯Ξ±Ξ²Ξ³!', re.U) if rr: print rr.group().encode('utf8') else: print "no match" # prints γΞ±Ξ²Ξ³γ, but if re.U is not used, it prints γno matchγ because the γ\w+γ pattern for βwordβ only consider ASCII letters
See: Python Regex Flags.
In Python 3, everything is Unicode (utf-8). You do not need to use # -*- coding: utf-8 -*-, nor need to have βuβ in string.
Python 3 supports Unicode in variable and function names. Example:
# -*- coding: utf-8 -*- β optional, but still good to have # python 3 def Ζ(n): return n+1 Ξ± = 4 print(Ζ(Ξ±)) # prints 5