Xah Talk Show 2026-01-12 Ep743 Unicode, code point, programing language design of string, bytes vs code unit vs char
Video Summary (Generated by AI, Edited by Human.)
- This video discusses Unicode, character encoding, and string handling in various programming languages (0:07).
- Before diving into the main topic, the speaker briefly introduces and reviews:
- Ergonomic keyboards (0:52-2:32), including the Ultimate Hacking Keyboard and Kinesis models.
- Trackballs and mice (2:47-5:25), highlighting the Nulea M505 and discussing "China cheapo" peripherals.
- The core of the video focuses on Unicode and its implications for programming:
- Unicode Basics (8:00-8:17, 10:40-11:04): Unicode is a standard that includes characters, symbols, and writing systems from all human languages, past and present. Each character has a unique code point, which is an integer ID (38:11).
- Character Encoding (38:46-39:40): This is the process of converting a character's ID into binary bits (bytes). The video specifically mentions:
- UTF-8 encoding: A widely used encoding system (16:33, 38:39).
- UTF-16 encoding: Where characters can be mapped to either two or four bytes depending on their code point (38:46, 46:12-46:51).
- String Handling in Programming Languages: The speaker highlights significant differences in how programming languages handle strings, particularly concerning Unicode characters.
- JavaScript (35:05-55:31): JavaScript strings are made of "code units" (two bytes). This leads to significant issues when dealing with characters that require more than two bytes (like many emojis or some Chinese characters), causing string operations to behave incorrectly or return invalid partial characters (35:41-54:02). The speaker considers JavaScript one of the "worst programming languages" in this regard, noting that existing string methods are "fatally flawed" for Unicode characters (54:28-54:50).
- Go (Golang) (14:46-16:57, 55:32-56:20): In contrast, Go defines its strings as sequences of bytes. Go also has a character type called "rune" (16:09, 16:47-16:51), which directly corresponds to a Unicode code point. The speaker praises Go for its robust handling of Unicode characters, asserting that all string operations work correctly even with complex Unicode strings (55:32-55:56).
console.log("abc".split(""));
console.log("a🦋c".split(""));
package main
import "fmt"
func main() {
fmt.Printf("%v\n", len("🦋"))
}
package main
import "fmt"
func main() {
var xslice = []rune ("abc🦋")
fmt.Printf("%v\n", xslice )
}