Ruby Unicode Tutorial 💎


Ruby has robust support of Unicode, starting with version 1.9. This page is about Ruby 1.9 or later.

If you are not familiar with Unicode, first see: Unicode Basics: What's Character Set, Character Encoding, UTF-8?

Source Code Encoding and Default Encoding for String

Start your file by # -*- coding: utf-8 -*-, on first or second line. This will make UTF-8 as the source code's encoding. This is called magic comment.

Any of the following form also work.

# -*- coding: utf-8 -*-  # also emacs, python, convention.
# -*- coding: UTF-8 -*-

# coding: utf-8
# coding: UTF-8

# encoding: utf-8
# encoding: UTF-8

p __ENCODING__                  # ⇒ #<Encoding:UTF-8>

Ruby String = Bytes + Encoding Info

In Ruby, each string is a object with info about encoding. You can use the method “encoding” to find a string's encoding. (this is different from most other language, where all string are converted into a internal encoding. ⁖ UTF-8 for Python, emacs lisp, UTF-16 in Java.)

# -*- coding: utf-8 -*-
# ruby

p "abc♥".encoding                # ⇒ #<Encoding:UTF-8>

p "abc♥"           # ⇒ UTF-8

p "abc♥".size                    # ⇒ 4

p "abc♥".bytesize                # ⇒ 6

Change a String's Encoding info

Use method “force_encoding” to change a string's ecoding info. This doesn't actually convert encoding of a string; it simply changes the encoding meta-data.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p ss                             # ⇒ "α"

p ss                            # ⇒ "\x{CEB1}"

Convert Encoding for a String

Use method “encode!” to convert a string's encoding. Use method “encode” to change encoding in output, but not modify the string.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p                             # ⇒ "UTF-8"

p                             # ⇒ "GB18030"

p ss.encode("utf-8")                           # ⇒ "α"

Default Encoding of Strings

The default encoding of a string is from your source code encoding.

When you open file, you can specify the external encoding like this:

# -*- coding: utf-8 -*-
# ruby

ff = "unicode_ruby.html", "r:UTF-8") # open a file for read, specify a encoding

ff.each {|xx| p xx }            # print each line

Ruby has concept of:


# -*- coding: utf-8 -*-
# ruby

# read a file.
# Tell Ruby what encoding is that file
# Tell Ruby the encoding to use when the content becomes Ruby string

ff = "unicode_ruby.html", "r:UTF-8:UTF-16")

p     # UTF-8
p     # UTF-16

# read a line, save to ss
ss = ff.readline

# print ss's encoding
p              # ⇒ UTF-16

When writing out to a file, you can also specify a encoding, like this: open("output.txt", "w:GB18030").

the Encoding Class; Supported Encoding

You can print all supported encoding like this:

# -*- coding: utf-8 -*-
# ruby

Encoding.list.each { |xx| p}

Thanks to:

blog comments powered by Disqus