Python: Unicode Tutorial 🐍

By Xah Lee. Date: . Last updated: .

Python Source Code Encoding

What's Python source code's default encoding?

For Python 3.x, it's UTF-8.

For Python 2.x, it's ASCII. [see ASCII Table]

Python 2: If your source code contains non-ASCII characters, you must declare the file's encoding in the first line or second line, Like this:

-*- coding: utf-8 -*-

# -*- coding: utf-8 -*-
# python 2

x = u"i β™₯ cats"

print x

The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. It tells any program reading the file that the file is encoded using a particular encoding.

If you don't know unicode, read this first: Unicode Basics: Character Set, Encoding, UTF-8, Codepoint

See also:

Unicode in Python 3

Python 3's string is a sequence of unicode characters.

In Python 3, everything is Unicode (UTF-8). You do not need # -*- coding: utf-8 -*-, nor u"…".

Python 3 supports Unicode in variable and function names.

# python 3

def Ζ’(n):
    return n+1

Ξ± = 4
print(Ζ’(Ξ±))
# 5

Note, unicode that are not letters are not allowed.

# python 3

β™₯ = 4

print(β™₯)

#     β™₯ = 4
#     ^
# SyntaxError: invalid character in identifier

Python 2: Declare Unicode String

For Python 2, strings that contain Unicode characters must start with u in front of the string.

For Python 3, any string quote can begin with u, example: u"xyz", but it has no meaning. Any string is already a Unicode datatype.

# -*- coding: utf-8 -*-
# python 2

aa = u"I β™₯ U"  # unicode string starts with β€œu”

print aa # I β™₯ U

The u makes the string a Unicode datatype. Without the u, string is just byte sequence.

Sometimes when you print Unicode strings, you may get a error like this:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to use the .encode() or .decode() method.

# -*- coding: utf-8 -*-
# python 2

myStr = u'Ξ±'

# Bad. This is a error.
print 'Greek alpha: ', myStr

# Good
print 'Greek alpha: ', myStr.encode('utf-8')

Unicode Escape Sequence in String

Both python 3 and python 2 can have a unicode characters literally in a string. for example, "i β™₯ u" or u"i β™₯ u".

You can also use escape sequences. There are 2 forms:

# python 3

# BLACK HEART SUIT, hex 2665
x = "β™₯"
y = "\u2665"

print(x == y)
# True

# unicode escape sequence, for char with more than 4 hex digits

# GRINNING CAT FACE WITH SMILING EYES, hex 1f638
x = "😸"
y = "\U0001f638"

print(x == y)
# True
# -*- coding: utf-8 -*-
# python 2

# BLACK HEART SUIT, hex 2665
x = u"β™₯"
y = u"\u2665"

print x == y
# True

# unicode escape sequence, for char with more than 4 hex digits

# GRINNING CAT FACE WITH SMILING EYES, hex 1f638
x = u"😸"
y = u"\U0001f638"

print x == y
# True

Python 2: Unicode in Regex

When using regex on Unicode string, and you want the word patterns {\w, \W} and boundary patterns {\b, \B}, dependent on the Unicode character properties, you need to add the Unicode flag re.U when calling regex functions.

# -*- coding: utf-8 -*-
# python 2

# example showing the difference of using re.U regex flag

import re

rr = re.findall(r"\w+", u"β™₯Ξ±Ξ²Ξ³!", re.U)

if rr:
    print rr
else:
    print "no match"

# prints [u'\u03b1\u03b2\u03b3']

# if re.U is not used, it prints γ€Œno match」 because the γ€Œ\w+」 pattern for β€œword” only consider ASCII letters

See: Python Regex Flags.

Find/Replace Unicode Char in String

# -*- coding: utf-8 -*-
# python 2

# example of finding all unicode char in a string

import re

ss = u"iβ™₯NY 😸"

# find all unicode chars
myResult = re.findall(u"[^\u0000-\u007e]+", ss)

if myResult:
    print myResult  # [u'\u2665', u'\U0001f638']
else:
    print "no match"

Find Unicode Character's Name, Codepoint, Properties

Python: Get Unicode Name, Codepoint

Convert File's Encoding

Python: Convert File Encoding

Unicode Characters Search

Unicode Characters βˆ‘ β™₯ πŸ˜„

[see Unicode Basics: Character Set, Encoding, UTF-8]

Python Text Processing

  1. Read/Write File
  2. Walk Directory
  3. File Path
  4. Process Unicode
  5. Convert File Encoding
  6. Convert File Encoding in a Dir
  7. Find Replace in dir
  8. Find Replace by Regex
  9. Count Word Frequency

File Encoding

  1. Unicode Basics: Character Set, Encoding, UTF-8, Codepoint
  2. HTML: Character Sets and Encoding
  3. Unicode in Ruby, Perl, Python, JavaScript, Java, Emacs Lisp, Mathematica
  4. Python: Unicode Tutorial 🐍
  5. Python: Convert File Encoding
  6. Python: Convert File Encoding for All Files in a Dir
  7. Perl: Unicode Tutorial πŸͺ
  8. Perl: Convert File Encoding
  9. Ruby: Unicode Tutorial πŸ’Ž
  10. Java: Convert File Encoding
  11. Linux: Convert File Encoding with iconv

If you have a question, put $5 at patreon and message me.

Python

  1. Python 3 Basics
  2. Python 2 Basics
  3. Python 2 and 3 Difference
  4. Print Version
  5. Builtin Help
  6. Quote String
  7. String Methods
  8. Format String
  9. Operators
  10. Complex Numbers
  11. True, False
  12. if then else
  13. Loop
  14. List Basics
  15. Loop Thru List
  16. Map f to List
  17. Copy Nested List
  18. List Comprehension
  19. List Methods
  20. Sort
  21. Dictionary
  22. Loop Thru Dict
  23. Dict Methods
  24. Tuple
  25. Sets
  26. Function
  27. Closure
  28. 2 Closure
  29. Decorator
  30. Class
  31. Object, ID, Type
  32. List Modules
  33. Write a Module
  34. Unicode 🐍

Regex

  1. Regex Basics
  2. Regex Reference

Text Processing

  1. Read/Write File
  2. Traverse Directory
  3. File Path
  4. Process Unicode
  5. Convert File Encoding
  6. Find Replace in dir
  7. Find Replace by Regex
  8. Count Word Frequency

Web

  1. Send Email
  2. GET Web Page
  3. Web Crawler
  4. HTTP POST

Misc

  1. JSON
  2. Find Script Path
  3. Get Env Var
  4. System Call
  5. Decompress Gzip
  6. Append String in Loop
  7. Timing f timeit
  8. Keyword Arg Default Value Unstable
  9. Check Page Load Size
  10. Thumbnail Generation