Golang: String, Byte Slice, Rune Slice
Why You Need to Convert Between String, Byte Slice, Rune Slice
One annoying thing about golang is that you have to constantly convert between {string, byte slice, rune slice}.
They are the same thing in 3 different formats.
- String is immutable byte sequence.
- Byte slice is mutable byte sequence.
- Rune slice is re-grouping of byte slice so that each index is a character.
String is a nice way to deal with short sequence of Bytes or ASCII Characters. Everytime you operate on string, such as find replace string or take substring, a new string is created. This is very inefficient if string is huge, such as file content. 〔see Golang: String〕
Byte slice is just like string, but mutable. You can modify each byte or character. This is very efficient for working with file content, either as text file, binary file. 〔see Golang: Slice〕
Rune slice is like byte slice, except that each index is a character instead of a byte. This is best if you work with text files that have lots non-ASCII characters, such as Chinese or Unicode: Math Symbols π² ∞ ∫ or Unicode: Emoji 😄 .
Convert Between String, Byte Slice, Rune Slice
Here's common solutions working with golang string as character sequence.
[]byte(str)
-
String to byte slice.
string(byteSlice)
-
Byte slice to string.
[]rune(str)
-
String to rune slice.
string(runeSlice)
-
Rune slice to string.
[]rune(string(byteSlice))
-
Byte slice to rune slice.
[]byte(string(runeSlice))
-
Rune slice to byte slice.
utf8.RuneCount(byteSlice)
-
Return the count of characters in byteSlice.
utf8.RuneCountInString(str)
-
Return the count of character in string. (“character” here means Rune).
String To Byte Slice
[]byte(str)
package main import "fmt" func main() { var x = "abc→" // convert string to byte slice var bs = ([]byte)(x) fmt.Printf("%v\n", bs) // [97 98 99 226 134 146] }
Byte Slice to String
string(byteSlice)
package main import "fmt" func main() { var bs = []byte{"a"[0], "b"[0], "c"[0], 0xE2, 0x86, 0x92} // 0xE2, 0x86, 0x92 is the utf8 encoding for → // convert byte slice to string var str = string(bs) fmt.Printf("%v\n", str) // abc→ // print type fmt.Printf("%T\n", str) // string }
String To Rune Slice
[]rune(str)
package main import "fmt" func main() { var str = "abc→" // convert string to rune slice var rs = []rune(str) fmt.Printf("%v\n", rs) // [97 98 99 8594] // print type fmt.Printf("%T\n", rs) // []int32 }
Rune Slice To String
string(runeSlice)
package main import "fmt" func main() { var rs = []rune{'a', 'b', 'c', '→'} // convert rune slice to string var str = string(rs) fmt.Printf("%#v\n", str) // "abc→" // print type fmt.Printf("%T\n", str) // string }
Byte Slice To Rune Slice
[]rune(string(byteSlice))
package main import "fmt" func main() { var bs = []byte{"a"[0], "b"[0], "c"[0], 0xE2, 0x86, 0x92} // 0xE2, 0x86, 0x92 is the utf8 encoding for → // convert byte slice to rune slice var rs = []rune(string(bs)) for _, v := range rs { fmt.Printf("%c", v) } // abc→ fmt.Printf("\n") // print type fmt.Printf("%T\n", rs) // []int32 }
Rune Slice To Byte Slice
[]byte(string(runeSlice))
package main import "fmt" func main() { var rs = []rune{'a', 'b', 'c', '→'} // print type fmt.Printf("%T\n", rs) // []int32 // convert rune slice to byte slice var bs = []byte(string(rs)) fmt.Printf("%#v\n", bs) // []byte{0x61, 0x62, 0x63, 0xe2, 0x86, 0x92} fmt.Printf("%d\n", bs) // [97 98 99 226 134 146] fmt.Printf("%q\n", bs) // "abc→" }
count of Characters
To count the number of character, there are few ways:
Use import "unicode/utf8"
utf8.RuneCount(byteSlice)
-
Return the count of characters in byteSlice.
utf8.RuneCountInString(str)
-
Return the count of character in string. (character here means Rune)
Or convert it to rune slice, then call len
, e.g. len([]rune("I ♥ U"))
package main import "fmt" import "unicode/utf8" func main() { var x = "I ♥ U" // number of bytes fmt.Printf("%v\n", len(x)) // 7 // number of characters fmt.Printf("%v\n", utf8.RuneCountInString(x)) // 5 }
Substring by Character Index
To get a substring with proper character boundaries, convert it to rune slice first. Like this:
package main import "fmt" func main() { // string of non-ascii chars var x = "♥😂→★🍎" // convert to rune slice var y = []rune(x) // take a slice from index 2 to 3 var z = y[2:4] // print as chars fmt.Printf("%q\n", z) // ['→' '★'] // print in go syntax fmt.Printf("%#v\n", z) // []int32{8594, 9733} }
Given Byte Index that Start a Character, Find Its Char Index
Given a index (of a char start byte) of a string (or byte slice), find the corresponding rune (char start) index.
Solution:
utf8.RuneCount(byteSlice[0,index])
or
utf8.RuneCountInString(textStr[0,index])
package main import "fmt" import "unicode/utf8" // chinese text (or any text containing non-ASCII) var x = "中文和英文" // 6 is the start of the char 和 var i = 6 // we want to show user the char position func main() { fmt.Printf("position of 和 is: %v\n", utf8.RuneCountInString(x[0:i])) // 2 // position of 和 is: 2 }
Given a Random Byte Index, Find the Index that Start a Char
Given index of a byte slice, how to find the byte index that starts a character before the byte? (the byte slice may contain non-ASCII characters. 〔see ASCII Characters〕 )
Solution, first:
import "unicode/utf8"
then
for !utf8.RuneStart(textBytes[index]) { index-- }
Sample code:
package main import "fmt" import "unicode/utf8" var x = "中文" // chinese var i = 4 func main() { fmt.Printf("%q\n", x[i]) // '\u0096' // result is a byte inside unicode byte sequence // set index to the index that begins a char for !utf8.RuneStart(x[i]) { i-- } fmt.Printf("%v\n", i) // 3 // the index that begins a unicode char is 3 fmt.Printf("%q\n", x[i:len(x)]) // "文" // now u can extra substring properly }
Iterate Character in String
for i, c := range str {body}
-
go thru characters in string. i is the index (with respect to bytes), c is the character.
package main import "fmt" func main() { const x = "abc♥ 😂d" for i, c := range x { fmt.Printf("%v %q\n", i, c) } } // 0 'a' // 1 'b' // 2 'c' // 3 '♥' // 6 ' ' // 7 '😂' // 11 'd'
if you don't need the index, do:
for _, c := range str {body}
package main import "fmt" func main() { const x = "♥ 😂" for _, c := range x { fmt.Printf("%q, %U\n", c, c) } } // '♥', U+2665 // ' ', U+0020 // '😂', U+1F602
Note: when you loop thru string by range, each character in string is basically turned into a Rune type.
package main import "fmt" func main() { const x = "♥ 😂" for _, c := range x { // print the char and its type fmt.Printf("%q, %T\n", c, c) } } // '♥', int32 // ' ', int32 // '😂', int32