Ruby: Unicode Tutorial 💎

By Xah Lee. Date: . Last updated: .

Ruby has robust support of Unicode, starting with version 1.9. This page is about Ruby 1.9 or later.

Source Code Encoding and Default Encoding for String

Start your file by # -*- coding: utf-8 -*-, on first or second line. This will make UTF-8 as the source code's encoding. This is called magic comment.

Any of the following form also work.

# -*- coding: utf-8 -*-  # also emacs, python, convention.
# -*- coding: UTF-8 -*-

# coding: utf-8
# coding: UTF-8

# encoding: utf-8
# encoding: UTF-8

p __ENCODING__ # prints #<Encoding:UTF-8>

Ruby String = Bytes + Encoding Info

In Ruby, each string is a object with info about encoding. You can use the method encoding to find a string's encoding.

# -*- coding: utf-8 -*-
# ruby

p "abc♥".encoding # #<Encoding:UTF-8>

p "abc♥".encoding.name # UTF-8

p "abc♥".size # 4

p "abc♥".bytesize # 6

Change a String's Encoding info

Use method force_encoding to change a string's encoding info. This doesn't actually convert encoding of a string; it simply changes the encoding meta-data.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p ss # "α"

ss.force_encoding("GB18030")
p ss # "\x{CEB1}"

Convert Encoding for a String

Use method encode! to convert a string's encoding.

Use method encode to change encoding in output, but not modify the string.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p ss.encoding.name # "UTF-8"

ss.encode!("GB18030")
p ss.encoding.name # "GB18030"

p ss.encode("utf-8") # "α"

Default Encoding of Strings

The default encoding of a string is from your source code encoding.

When you open file, you can specify the external encoding like this:

# -*- coding: utf-8 -*-
# ruby

ff = File.open( "file.txt", "r:UTF-8") # open a file for read, specify a encoding

ff.each {|xx| p xx } # print each line

Ruby has concept of:

Example:

# -*- coding: utf-8 -*-
# ruby

# read a file.
# Tell Ruby what encoding is that file
# Tell Ruby the encoding to use when the content becomes Ruby string

ff = File.open( "file.txt", "r:UTF-8:UTF-16")

p ff.external_encoding.name # UTF-8
p ff.internal_encoding.name # UTF-16

# read a line, save to ss
ss = ff.readline

# print ss's encoding
p ss.encoding.name # UTF-16

When writing out to a file, you can also specify a encoding, like this: open("output.txt", "w:GB18030").

Supported Encoding

You can print all supported encoding like this:

# -*- coding: utf-8 -*-
# ruby

Encoding.list.each { |xx| p xx.name}