Ruby: Unicode Tutorial 💎

By Xah Lee. Date: . Last updated: .

Ruby has robust support of Unicode, starting with version 1.9. This page is about Ruby 1.9 or later.

Source Code Encoding and Default Encoding for String

Start your file by # -*- coding: utf-8 -*-, on first or second line. This will make UTF-8 as the source code's encoding. This is called magic comment.

Any of the following form also work.

# -*- coding: utf-8 -*-  # also emacs, python, convention.
# -*- coding: UTF-8 -*-

# coding: utf-8
# coding: UTF-8

# encoding: utf-8
# encoding: UTF-8

p __ENCODING__                  # prints #<Encoding:UTF-8>

Ruby String = Bytes + Encoding Info

In Ruby, each string is a object with info about encoding. You can use the method encoding to find a string's encoding.

# -*- coding: utf-8 -*-
# ruby

p "abc♥".encoding                # ⇒ #<Encoding:UTF-8>

p "abc♥"           # ⇒ UTF-8

p "abc♥".size                    # ⇒ 4

p "abc♥".bytesize                # ⇒ 6

Change a String's Encoding info

Use method force_encoding to change a string's encoding info. This doesn't actually convert encoding of a string; it simply changes the encoding meta-data.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p ss                             # ⇒ "α"

p ss                            # ⇒ "\x{CEB1}"

Convert Encoding for a String

Use method encode! to convert a string's encoding.

Use method encode to change encoding in output, but not modify the string.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p                             # ⇒ "UTF-8"

p                             # ⇒ "GB18030"

p ss.encode("utf-8")                           # ⇒ "α"

Default Encoding of Strings

The default encoding of a string is from your source code encoding.

When you open file, you can specify the external encoding like this:

# -*- coding: utf-8 -*-
# ruby

ff = "file.txt", "r:UTF-8") # open a file for read, specify a encoding

ff.each {|xx| p xx }            # print each line

Ruby has concept of:


# -*- coding: utf-8 -*-
# ruby

# read a file.
# Tell Ruby what encoding is that file
# Tell Ruby the encoding to use when the content becomes Ruby string

ff = "file.txt", "r:UTF-8:UTF-16")

p     # UTF-8
p     # UTF-16

# read a line, save to ss
ss = ff.readline

# print ss's encoding
p              # ⇒ UTF-16

When writing out to a file, you can also specify a encoding, like this: open("output.txt", "w:GB18030").

the Encoding Class; Supported Encoding

You can print all supported encoding like this:

# -*- coding: utf-8 -*-
# ruby

Encoding.list.each { |xx| p}

If you are not familiar with Unicode, see: Unicode Basics: What's Character Set, Character Encoding, UTF-8?.