Golang: String, Byte Slice, Rune Slice
Why You Need to Convert Between String, Byte Slice, Rune Slice
One annoying thing about golang is that you have to constantly convert between {string, byte slice, rune slice}.
They are the same thing in 3 different formats.
- String is immutable byte sequence.
- Byte slice is mutable byte sequence.
- Rune slice is re-grouping of byte slice so that each index is a character.
String is a nice way to deal with short sequence of Bytes or ASCII Characters. Everytime you operate on string, such as find replace string or take substring, a new string is created. This is very inefficient if string is huge, such as file content. [see Golang: String]
Byte slice is just like string, but mutable. You can modify each byte or character. This is very efficient for working with file content, either as text file, binary file. [see Golang: Slice]
Rune slice is like byte slice, except that each index is a character instead of a byte. This is best if you work with text files that have lots non-ASCII characters, such as Chinese or Unicode: Math Symbols ∑ ∫ π² ∞ or Unicode: Emoji 😄 .
Convert Between String, Byte Slice, Rune Slice
Here's common solutions working with golang string as character sequence.
[]byte(str)
- String to byte slice.
string(byteSlice)
- Byte slice to string.
[]rune(str)
- String to rune slice.
string(runeSlice)
- Rune slice to string.
[]rune(string(byteSlice))
- Byte slice to rune slice.
[]byte(string(runeSlice))
- Rune slice to byte slice.
utf8.RuneCount(byteSlice)
- Return the count of characters in byteSlice.
utf8.RuneCountInString(str)
- Return the count of character in string. (“character” here means Rune).
String To Byte Slice
[]byte(str)
package main import "fmt" func main() { var x = "abc→" // convert string to byte slice var bs = ([]byte)(x) fmt.Printf("%v\n", bs) // [97 98 99 226 134 146] }
Byte Slice to String
string(byteSlice)
package main import "fmt" func main() { var bs = []byte{"a"[0], "b"[0], "c"[0], 0xE2, 0x86, 0x92} // 0xE2, 0x86, 0x92 is the utf8 encoding for → // convert byte slice to string var str = string(bs) fmt.Printf("%v\n", str) // abc→ // print type fmt.Printf("%T\n", str) // string }
String To Rune Slice
[]rune(str)
package main import "fmt" func main() { var str = "abc→" // convert string to rune slice var rs = []rune(str) fmt.Printf("%v\n", rs) // [97 98 99 8594] // print type fmt.Printf("%T\n", rs) // []int32 }
Rune Slice To String
string(runeSlice)
package main import "fmt" func main() { var rs = []rune{'a', 'b', 'c', '→'} // convert rune slice to string var str = string(rs) fmt.Printf("%#v\n", str) // "abc→" // print type fmt.Printf("%T\n", str) // string }
Byte Slice To Rune Slice
[]rune(string(byteSlice))
package main import "fmt" func main() { var bs = []byte{"a"[0], "b"[0], "c"[0], 0xE2, 0x86, 0x92} // 0xE2, 0x86, 0x92 is the utf8 encoding for → // convert byte slice to rune slice var rs = []rune(string(bs)) for _, v := range rs { fmt.Printf("%c", v) } // abc→ fmt.Printf("\n") // print type fmt.Printf("%T\n", rs) // []int32 }
Rune Slice To Byte Slice
[]byte(string(runeSlice))
package main import "fmt" func main() { var rs = []rune{'a', 'b', 'c', '→'} // print type fmt.Printf("%T\n", rs) // []int32 // convert rune slice to byte slice var bs = []byte(string(rs)) fmt.Printf("%#v\n", bs) // []byte{0x61, 0x62, 0x63, 0xe2, 0x86, 0x92} fmt.Printf("%d\n", bs) // [97 98 99 226 134 146] fmt.Printf("%q\n", bs) // "abc→" }
count of Characters
To count the number of character, there are few ways:
Use import "unicode/utf8"
utf8.RuneCount(byteSlice)
- Return the count of characters in byteSlice.
utf8.RuneCountInString(str)
- Return the count of character in string. (character here means Rune)
Or convert it to rune slice, then call len
, example: len([]rune("I ♥ U"))
package main import "fmt" import "unicode/utf8" func main() { var x = "I ♥ U" // number of bytes fmt.Printf("%v\n", len(x)) // 7 // number of characters fmt.Printf("%v\n", utf8.RuneCountInString(x)) // 5 }
Substring by Character Index
To get a substring with proper character boundaries, convert it to rune slice first. Like this:
package main import "fmt" func main() { // string of non-ascii chars var x = "♥😂→★🍎" // convert to rune slice var y = []rune(x) // take a slice from index 2 to 3 var z = y[2:4] // print as chars fmt.Printf("%q\n", z) // ['→' '★'] // print in go syntax fmt.Printf("%#v\n", z) // []int32{8594, 9733} }
Given Byte Index that Start a Character, Find Its Char Index
Given a index (of a char start byte) of a string (or byte slice), find the corresponding rune (char start) index.
Solution:
utf8.RuneCount(byteSlice[0,index])
or
utf8.RuneCountInString(textStr[0,index])
package main import "fmt" import "unicode/utf8" // chinese text (or any text containing non-ASCII) var x = "中文和英文" // 6 is the start of the char 和 var i = 6 // we want to show user the char position func main() { fmt.Printf("position of 和 is: %v\n", utf8.RuneCountInString(x[0:i])) // 2 // position of 和 is: 2 }
Given a Random Byte Index, Find the Index that Start a Char
Given index of a byte slice, how to find the byte index that starts a character before the byte? (the byte slice may contain non-ASCII characters. [see ASCII Characters] )
Solution:
import "unicode/utf8"
then
for !utf8.RuneStart(textBytes[index]) { index-- }
Sample code:
package main import "fmt" import "unicode/utf8" var x = "中文" // chinese var i = 4 func main() { fmt.Printf("%q\n", x[i]) // '\u0096' // result is a byte inside unicode byte sequence // set index to the index that begins a char for !utf8.RuneStart(x[i]) { i-- } fmt.Printf("%v\n", i) // 3 // the index that begins a unicode char is 3 fmt.Printf("%q\n", x[i:len(x)]) // "文" // now u can extra substring properly }
Loop Thru Character in String
for i, c := range str {…}
- go thru characters in string. i is the index (with respect to bytes), c is the character.
package main import "fmt" func main() { const x = "abc♥ 😂d" for i, c := range x { fmt.Printf("%v %q\n", i, c) } } // 0 'a' // 1 'b' // 2 'c' // 3 '♥' // 6 ' ' // 7 '😂' // 11 'd'
if you don't need the index, do:
for _, c := range str {…}
package main import "fmt" func main() { const x = "♥ 😂" for _, c := range x { fmt.Printf("%q, %U\n", c, c) } } // '♥', U+2665 // ' ', U+0020 // '😂', U+1F602
Note: when you loop thru string by range, each character in string is basically turned into a “rune” type, which is golang's term for Unicode Codepoint. That is, a integer id for the character.
package main import "fmt" func main() { const x = "♥ 😂" for _, c := range x { // print the char and its type fmt.Printf("%q, %T\n", c, c) } } // '♥', int32 // ' ', int32 // '😂', int32
[see Golang: Rune]