If you are not familiar with Unicode, first see: UNICODE Basics: What's Character Set, Character Encoding, UTF-8, and All That?
Ruby, starting with version 1.9, has robust support of Unicode. This page is about Ruby 1.9.
Start your file by # -*- coding: utf-8 -*-, on first or second line. This will make UTF-8 as the source code's encoding. This is called magic comment.
Any of the following form also work.
# -*- coding: utf-8 -*- # ← emacs convention. Python too. # -*- coding: UTF-8 -*- # coding: utf-8 # coding: UTF-8 # encoding: utf-8 # encoding: UTF-8 p __ENCODING__ # ⇒ #<Encoding:UTF-8>
In Ruby, each string is a object with info about encoding. You can use the method “encoding” to find a string's encoding. (this is different from most other language, where all string are converted into a internal encoding. ⁖ UTF-8 for Python, emacs lisp, UTF-16 in Java.)
# -*- coding: utf-8 -*- # ruby p "abc♥".encoding # ⇒ #<Encoding:UTF-8> p "abc♥".encoding.name # ⇒ UTF-8 p "abc♥".size # ⇒ 4 p "abc♥".bytesize # ⇒ 6
Use method “force_encoding” to change a string's ecoding info. This doesn't actually convert encoding of a string; it simply changes the encoding meta-data.
# -*- coding: utf-8 -*- # ruby ss = "α" p ss # ⇒ "α" ss.force_encoding("GB18030") p ss # ⇒ "\x{CEB1}"
Use method “encode!” to convert a string's encoding. Use method “encode” to change encoding in output, but not modify the string.
# -*- coding: utf-8 -*- # ruby ss = "α" p ss.encoding.name # ⇒ "UTF-8" ss.encode!("GB18030") p ss.encoding.name # ⇒ "GB18030" p ss.encode("utf-8") # ⇒ "α"
The default encoding of a string is from your source code encoding.
When you open file, you can specify the external encoding like this:
# -*- coding: utf-8 -*- # ruby ff = File.open( "unicode_ruby.html", "r:UTF-8") # open a file for read, specify a encoding ff.each {|xx| p xx } # print each line
Ruby has concept of:
Example:
# -*- coding: utf-8 -*- # ruby # read a file. # Tell Ruby what encoding is that file # Tell Ruby the encoding to use when the content becomes Ruby string ff = File.open( "unicode_ruby.html", "r:UTF-8:UTF-16") p ff.external_encoding.name # UTF-8 p ff.internal_encoding.name # UTF-16 # read a line, save to ss ss = ff.readline # print ss's encoding p ss.encoding.name # ⇒ UTF-16
When writing out to a file, you can also specify a encoding, like this: open("output.txt", "w:GB18030").
You can print all supported encoding like this:
# -*- coding: utf-8 -*- # ruby Encoding.list.each { |xx| p xx.name}
"ASCII-8BIT" "UTF-8" "US-ASCII" "Big5" "Big5-HKSCS" "Big5-UAO" "CP949" "Emacs-Mule" "EUC-JP" "EUC-KR" "EUC-TW" "GB18030" "GBK" "ISO-8859-1" "ISO-8859-2" "ISO-8859-3" "ISO-8859-4" "ISO-8859-5" "ISO-8859-6" "ISO-8859-7" "ISO-8859-8" "ISO-8859-9" "ISO-8859-10" "ISO-8859-11" "ISO-8859-13" "ISO-8859-14" "ISO-8859-15" "ISO-8859-16" "KOI8-R" "KOI8-U" "Shift_JIS" "UTF-16BE" "UTF-16LE" "UTF-32BE" "UTF-32LE" "Windows-1251" "IBM437" "IBM737" "IBM775" "CP850" "IBM852" "CP852" "IBM855" "CP855" "IBM857" "IBM860" "IBM861" "IBM862" "IBM863" "IBM864" "IBM865" "IBM866" "IBM869" "Windows-1258" "GB1988" "macCentEuro" "macCroatian" "macCyrillic" "macGreek" "macIceland" "macRoman" "macRomania" "macThai" "macTurkish" "macUkraine" "CP950" "CP951" "stateless-ISO-2022-JP" "eucJP-ms" "CP51932" "GB2312" "GB12345" "ISO-2022-JP" "ISO-2022-JP-2" "CP50220" "CP50221" "Windows-1252" "Windows-1250" "Windows-1256" "Windows-1253" "Windows-1255" "Windows-1254" "TIS-620" "Windows-874" "Windows-1257" "Windows-31J" "MacJapanese" "UTF-7" "UTF8-MAC" "UTF-16" "UTF-32" "UTF8-DoCoMo" "SJIS-DoCoMo" "UTF8-KDDI" "SJIS-KDDI" "ISO-2022-JP-KDDI" "stateless-ISO-2022-JP-KDDI" "UTF8-SoftBank" "SJIS-SoftBank"
Thanks to: