Ruby Unicode Tutorial 💎

,

If you are not familiar with Unicode, first see: Unicode Basics: What's Character Set, Character Encoding, UTF-8?

Ruby, starting with version 1.9, has robust support of Unicode. This page is about Ruby 1.9.

Source Code Encoding and Default Encoding for String

Start your file by # -*- coding: utf-8 -*-, on first or second line. This will make UTF-8 as the source code's encoding. This is called magic comment.

Any of the following form also work.

# -*- coding: UTF-8 -*-
# -*- coding: utf-8 -*-           # ← emacs convention. Python too.

# coding: utf-8
# coding: UTF-8

# encoding: utf-8
# encoding: UTF-8

p __ENCODING__                  # ⇒ #<Encoding:UTF-8>

Ruby String = Bytes + Encoding Info

In Ruby, each string is a object with info about encoding. You can use the method “encoding” to find a string's encoding. (this is different from most other language, where all string are converted into a internal encoding. ⁖ UTF-8 for Python, emacs lisp, UTF-16 in Java.)

# -*- coding: utf-8 -*-
# ruby

p "abc♥".encoding                # ⇒ #<Encoding:UTF-8>

p "abc♥".encoding.name           # ⇒ UTF-8

p "abc♥".size                    # ⇒ 4

p "abc♥".bytesize                # ⇒ 6

Change a String's Encoding info

Use method “force_encoding” to change a string's ecoding info. This doesn't actually convert encoding of a string; it simply changes the encoding meta-data.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p ss                             # ⇒ "α"

ss.force_encoding("GB18030")
p ss                            # ⇒ "\x{CEB1}"

Convert Encoding for a String

Use method “encode!” to convert a string's encoding. Use method “encode” to change encoding in output, but not modify the string.

# -*- coding: utf-8 -*-
# ruby

ss = "α"
p ss.encoding.name                             # ⇒ "UTF-8"

ss.encode!("GB18030")
p ss.encoding.name                             # ⇒ "GB18030"

p ss.encode("utf-8")                           # ⇒ "α"

Default Encoding of Strings

The default encoding of a string is from your source code encoding.

When you open file, you can specify the external encoding like this:

# -*- coding: utf-8 -*-
# ruby

ff = File.open( "unicode_ruby.html", "r:UTF-8") # open a file for read, specify a encoding

ff.each {|xx| p xx }            # print each line

Ruby has concept of:

Example:

# -*- coding: utf-8 -*-
# ruby

# read a file.
# Tell Ruby what encoding is that file
# Tell Ruby the encoding to use when the content becomes Ruby string

ff = File.open( "unicode_ruby.html", "r:UTF-8:UTF-16")

p ff.external_encoding.name     # UTF-8
p ff.internal_encoding.name     # UTF-16

# read a line, save to ss
ss = ff.readline

# print ss's encoding
p ss.encoding.name              # ⇒ UTF-16

When writing out to a file, you can also specify a encoding, like this: open("output.txt", "w:GB18030").

the Encoding Class; Supported Encoding

You can print all supported encoding like this:

# -*- coding: utf-8 -*-
# ruby

Encoding.list.each { |xx| p xx.name}
"ASCII-8BIT"
"UTF-8"
"US-ASCII"
"Big5"
"Big5-HKSCS"
"Big5-UAO"
"CP949"
"Emacs-Mule"
"EUC-JP"
"EUC-KR"
"EUC-TW"
"GB18030"
"GBK"
"ISO-8859-1"
"ISO-8859-2"
"ISO-8859-3"
"ISO-8859-4"
"ISO-8859-5"
"ISO-8859-6"
"ISO-8859-7"
"ISO-8859-8"
"ISO-8859-9"
"ISO-8859-10"
"ISO-8859-11"
"ISO-8859-13"
"ISO-8859-14"
"ISO-8859-15"
"ISO-8859-16"
"KOI8-R"
"KOI8-U"
"Shift_JIS"
"UTF-16BE"
"UTF-16LE"
"UTF-32BE"
"UTF-32LE"
"Windows-1251"
"IBM437"
"IBM737"
"IBM775"
"CP850"
"IBM852"
"CP852"
"IBM855"
"CP855"
"IBM857"
"IBM860"
"IBM861"
"IBM862"
"IBM863"
"IBM864"
"IBM865"
"IBM866"
"IBM869"
"Windows-1258"
"GB1988"
"macCentEuro"
"macCroatian"
"macCyrillic"
"macGreek"
"macIceland"
"macRoman"
"macRomania"
"macThai"
"macTurkish"
"macUkraine"
"CP950"
"CP951"
"stateless-ISO-2022-JP"
"eucJP-ms"
"CP51932"
"GB2312"
"GB12345"
"ISO-2022-JP"
"ISO-2022-JP-2"
"CP50220"
"CP50221"
"Windows-1252"
"Windows-1250"
"Windows-1256"
"Windows-1253"
"Windows-1255"
"Windows-1254"
"TIS-620"
"Windows-874"
"Windows-1257"
"Windows-31J"
"MacJapanese"
"UTF-7"
"UTF8-MAC"
"UTF-16"
"UTF-32"
"UTF8-DoCoMo"
"SJIS-DoCoMo"
"UTF8-KDDI"
"SJIS-KDDI"
"ISO-2022-JP-KDDI"
"stateless-ISO-2022-JP-KDDI"
"UTF8-SoftBank"
"SJIS-SoftBank"

Thanks to:

blog comments powered by Disqus