Python: Unicode Tutorial 🐍

By Xah Lee. Date: . Last updated: .

Unicode Characters in Source Code

What is Python source code's default encoding?

For Python 2.x, it's ASCII. [see ASCII Table]

For Python 3.x, it's UTF-8.

Python 2: If your source code contains Unicode characters, you must declare the file's encoding in the first line or second line, Like this:

-*- coding: utf-8 -*-

#-*- coding: utf-8 -*-
# python 2

x = u"i β™₯ cats"

print x

The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. [see Emacs and Unicode Tips] It tells any program reading the file that the file is encoded using a particular encoding. [see HTML: Character Sets and Encoding] This convention is also used by Ruby. [see Ruby: Unicode Tutorial πŸ’Ž]

For a basic introduction to Unicode, see: Unicode Basics: Character Set, Encoding, UTF-8

Python 2: Declare Unicode String

For Python 2, strings that contain Unicode characters must start with u in front of the string.

#-*- coding: utf-8 -*-
# python 2

aa = u"I β™₯ U"  # unicode string starts with β€œu”

print aa # I β™₯ U

The u makes the string a Unicode datatype. Without the u, string is just byte sequence.

Sometimes when you print Unicode strings, you may get a error like this:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to use the .encode() or .decode() method.

#-*- coding: utf-8 -*-
# python 2

myStr = u'Ξ±'

# Bad. This is a error.
print 'Greek alpha: ', myStr

# Good
print 'Greek alpha: ', myStr.encode('utf-8')

Unicode Escape Sequence in String

You can have a unicode character literally in a string, such as u'i β™₯ u'.

You can also use escape sequences. There are 2 forms:

# -*- coding: utf-8 -*-
# python 2

x = u"β™₯" # BLACK HEART SUIT, hex 2665
y = u"\u2665"

print x == y # True
# -*- coding: utf-8 -*-
# python 2

# unicode escape sequence, for char with more than 4 hex digits

x = u"😸"  # name: GRINNING CAT FACE WITH SMILING EYES, hex 1f638
y = u"\U0001f638"

print x == y # True

Unicode in Regex

When using regex on Unicode string, and you want the word patterns {\w, \W} and boundary patterns {\b, \B}, dependent on the Unicode character properties, you need to add the Unicode flag re.U when calling regex functions.

# -*- coding: utf-8 -*-
# python 2

# example showing the difference of using re.U regex flag

import re

rr = re.findall(r"\w+", u"β™₯Ξ±Ξ²Ξ³!", re.U)

if rr:
    print rr
    print "no match"

# prints [u'\u03b1\u03b2\u03b3']

# if re.U is not used, it prints γ€Œno match」 because the γ€Œ\w+」 pattern for β€œword” only consider ASCII letters

See: Python Regex Flags.

Find/Replace Unicode Char in String

# -*- coding: utf-8 -*-
# python 2

# example of finding all unicode char in a string

import re

ss = u"iβ™₯NY 😸"

# find all unicode chars
myResult = re.findall(u"[^\u0000-\u007e]+", ss)

if myResult:
    print myResult  # [u'\u2665', u'\U0001f638']
    print "no match"

Find Unicode Character's Name, Codepoint, Properties

See: Python: Get Unicode Name, Codepoint.

Convert File's Encoding

See: Python: Convert File Encoding.

Unicode in Python 3

In Python 3, everything is Unicode (UTF-8). You do not need # -*- coding: utf-8 -*-, nor u"…".

Python 3 supports Unicode in variable and function names.

# -*- coding: utf-8 -*-
# python 3

# the first line is optional, but still good to indicate encoding

def Ζ’(n):
    return n+1

Ξ± = 4
print(Ζ’(Ξ±)) # prints 5

Note, unicode that are not letters are not allowed.

# python 3

β™₯ = 4


#     β™₯ = 4
#     ^
# SyntaxError: invalid character in identifier

Unicode Characters Search

Unicode Characters βˆ‘ β™₯ πŸ˜„

[see Unicode Basics: Character Set, Encoding, UTF-8]

Python Text Processing

  1. Read/Write File
  2. Walk Directory
  3. Python 3: Walk Directory
  4. Manipulate Path
  5. Process Unicode
  6. Convert File Encoding
  7. Convert File Encoding in a Dir
  8. Find Replace in dir
  9. Find Replace by Regex
  10. Count Word Frequency

File Encoding

  1. Unicode Basics: Character Set, Encoding, UTF-8, Codepoint
  2. HTML: Character Sets and Encoding
  3. Unicode in Ruby, Perl, Python, JavaScript, Java, Emacs Lisp, Mathematica
  4. Python: Unicode Tutorial 🐍
  5. Python: Convert File Encoding
  6. Python: Convert File Encoding for All Files in a Dir
  7. Perl: Unicode Tutorial πŸͺ
  8. Perl: Convert File Encoding
  9. Ruby: Unicode Tutorial πŸ’Ž
  10. Java: Convert File Encoding
  11. Linux: Convert File Encoding with iconv

If you have a question, put $5 at patreon and message me.

Python by Example

  1. Python Basics
  2. Print Version String
  3. Builtin Help
  4. Quote String
  5. String Operations
  6. String Methods
  7. Format String
  8. True, False
  9. if then else
  10. for, while, Loops
  11. List Basics
  12. Loop Thru List
  13. Map Function to List
  14. List Comprehension
  15. List Methods
  16. Dictionary
  17. Loop Thru Dict
  18. Dict Methods
  19. Function
  20. Class
  21. List Modules
  22. Write a Module
  23. Unicode 🐍


  1. Regex Basics
  2. Regex Reference

Text Processing

  1. Read/Write File
  2. Traverse Directory
  3. Manipulate Path
  4. Process Unicode
  5. Convert File Encoding
  6. Find Replace in dir
  7. Find Replace by Regex
  8. Count Word Frequency


  1. Send Email
  2. GET Web Page
  3. Web Crawler


  1. JSON
  2. Find Script Path
  3. Get Env Var
  4. System Call
  5. Decompress Gzip
  6. Complex Numbers
  7. Sort
  8. Copy Nested List
  9. Tuple vs List
  10. Sets, Union, Intersection
  11. Closure in Python 2
  12. Decorator
  13. Append String in Loop
  14. Timing f timeit
  15. Keyword Arg Default Value Unstable
  16. Check Page Load Size
  17. Thumbnail Generation

Python 3

  1. Python 3 Basics
  2. Print Version String
  3. Quoting String
  4. String Operations
  5. Format String
  6. Operators
  7. Object, ID, Type
  8. Traverse Directory
  9. Sort List, Matrix, Object
  10. Python 3: Map with Side Effect Doesn't Work If Result is Not Used
  11. Python 3 Closure
  12. Python 2 and 3 Difference