Ruby Unicode Tutorial 💎

,

If you are not familiar with Unicode, first see: UNICODE Basics: What's Character Set, Character Encoding, UTF-8, and All That?

Ruby, starting with version 1.9, has robust support of Unicode. This page is about Ruby 1.9.

Source Code Encoding and Default Encoding for String

Start your file by # -*- coding: utf-8 -*-, on first or second line. This will make UTF-8 as the source code's encoding. This is called magic comment.

Any of the following form also work.

# -*- coding: UTF-8 -*-
# -*- coding: utf-8 -*-           # ← emacs convention. Python too.

# coding: utf-8
# coding: UTF-8

# encoding: utf-8
# encoding: UTF-8

p __ENCODING__                  # ⇒ #<Encoding:UTF-8>

Ruby String = Bytes + Encoding Info

In Ruby, each string is a object with info about encoding. You can use the method “encoding” to find a string's encoding. (this is different from most other language, where all string are converted into a internal encoding. ⁖ UTF-8 for Python, emacs lisp, UTF-16 in Java.)

# -*- coding: utf-8 -*-
# ruby

p "abc♥".encoding                # ⇒ #<Encoding:UTF-8>

p "abc♥".encoding.name           # ⇒ UTF-8

p "abc♥".size                    # ⇒ 4

p "abc♥".bytesize                # ⇒ 6

Change a String's Encoding info

Use method “force_encoding” to change a string's ecoding info. This doesn't actually convert encoding of a string; it simply changes the encoding meta-data.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p ss                             # ⇒ "α"

ss.force_encoding("GB18030")
p ss                            # ⇒ "\x{CEB1}"

Convert Encoding for a String

Use method “encode!” to convert a string's encoding. Use method “encode” to change encoding in output, but not modify the string.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p ss.encoding.name                             # ⇒ "UTF-8"

ss.encode!("GB18030")
p ss.encoding.name                             # ⇒ "GB18030"

p ss.encode("utf-8")                           # ⇒ "α"

Default Encoding of Strings

The default encoding of a string is from your source code encoding.

When you open file, you can specify the external encoding like this:

# -*- coding: utf-8 -*-
# ruby

ff = File.open( "unicode_ruby.html", "r:UTF-8") # open a file for read, specify a encoding

ff.each {|xx| p xx }            # print each line

Ruby has concept of:

Example:

# -*- coding: utf-8 -*-
# ruby

# read a file.
# Tell Ruby what encoding is that file
# Tell Ruby the encoding to use when the content becomes Ruby string

ff = File.open( "unicode_ruby.html", "r:UTF-8:UTF-16")

p ff.external_encoding.name     # UTF-8
p ff.internal_encoding.name     # UTF-16

# read a line, save to ss
ss = ff.readline

# print ss's encoding
p ss.encoding.name              # ⇒ UTF-16

When writing out to a file, you can also specify a encoding, like this: open("output.txt", "w:GB18030").

the Encoding Class; Supported Encoding

You can print all supported encoding like this:

# -*- coding: utf-8 -*-
# ruby

Encoding.list.each { |xx| p xx.name}
"ASCII-8BIT"
"UTF-8"
"US-ASCII"
"Big5"
"Big5-HKSCS"
"Big5-UAO"
"CP949"
"Emacs-Mule"
"EUC-JP"
"EUC-KR"
"EUC-TW"
"GB18030"
"GBK"
"ISO-8859-1"
"ISO-8859-2"
"ISO-8859-3"
"ISO-8859-4"
"ISO-8859-5"
"ISO-8859-6"
"ISO-8859-7"
"ISO-8859-8"
"ISO-8859-9"
"ISO-8859-10"
"ISO-8859-11"
"ISO-8859-13"
"ISO-8859-14"
"ISO-8859-15"
"ISO-8859-16"
"KOI8-R"
"KOI8-U"
"Shift_JIS"
"UTF-16BE"
"UTF-16LE"
"UTF-32BE"
"UTF-32LE"
"Windows-1251"
"IBM437"
"IBM737"
"IBM775"
"CP850"
"IBM852"
"CP852"
"IBM855"
"CP855"
"IBM857"
"IBM860"
"IBM861"
"IBM862"
"IBM863"
"IBM864"
"IBM865"
"IBM866"
"IBM869"
"Windows-1258"
"GB1988"
"macCentEuro"
"macCroatian"
"macCyrillic"
"macGreek"
"macIceland"
"macRoman"
"macRomania"
"macThai"
"macTurkish"
"macUkraine"
"CP950"
"CP951"
"stateless-ISO-2022-JP"
"eucJP-ms"
"CP51932"
"GB2312"
"GB12345"
"ISO-2022-JP"
"ISO-2022-JP-2"
"CP50220"
"CP50221"
"Windows-1252"
"Windows-1250"
"Windows-1256"
"Windows-1253"
"Windows-1255"
"Windows-1254"
"TIS-620"
"Windows-874"
"Windows-1257"
"Windows-31J"
"MacJapanese"
"UTF-7"
"UTF8-MAC"
"UTF-16"
"UTF-32"
"UTF8-DoCoMo"
"SJIS-DoCoMo"
"UTF8-KDDI"
"SJIS-KDDI"
"ISO-2022-JP-KDDI"
"stateless-ISO-2022-JP-KDDI"
"UTF8-SoftBank"
"SJIS-SoftBank"

Thanks to:

blog comments powered by Disqus