JS: String Code Unit
What is Code Unit?
JavaScript string is sequence of 16-bits values (called code unit) that represent characters from UTF-16 Encoding .
Each “element” in a string is technically not a “character”, but a “code unit”.
If the string contains ASCII characters only, then, each “code unit” corresponds to a character.
If the string contains a character whose Codepoint is ≥ 2^16 (e.g. 😂) , that character is 2 code units, thus occupies more than 1 index. Result of string functions may be unexpected.
Example: Difference of Character and Code Unit
console.log("😂".length === 2); // true // 😂 // name: FACE WITH TEARS OF JOY // codepoint in decimal: 128514
Here is a example with String.prototype.slice method with unexpected result.
// we want to take the substring abc console.log(("😂abc".slice(1) === "abc") === false); // true // wrong // 😂 // name: FACE WITH TEARS OF JOY // codepoint in decimal: 128514
Code Unit Explained
Here is a more detailed explanation.
- JavaScript string and character are based on Unicode standard, version 5.1 or later.
- In unicode, each character has an integer ID, called Codepoint.
- Unicode specifies several encoding standards, most popular ones are • UTF-8 • UTF-16.
- Encoding means, a standard that translate a character into sequence of Bytes. 〔see Unicode: Character Set, Encoding, UTF-8, Codepoint〕
- UTF-16 encoding converts each character into 2 or 4 bytes, depending on the character. (each 2 bytes is considered a unit, called code unit.)
- For characters whose codepoint is less than 2^16, the encoding of that char in UTF-16 is 2 bytes. Otherwise, it's 4 bytes.
- JavaScript defines element of string as sequence of 2-byte values of the characters encoded in UTF-16. That is, first encode the character in the string to bits by UTF-16, you get 2 or 4 bytes. Then, group every 2 bytes as a code unit. Then, index 0 is first 2-bytes unit, index 1 is second 2-bytes unit, etc. This means, when a string contain character whose codepoint is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.
What Characters Create Problems?
Characters whose codepoint ≥ 2^16 are emoji and other less frequently used characters, e.g. rarely used Chinese characters, ancient language characters.
〔see Unicode: Emoji 😄〕
Character whose codepoint is less than 2^16 are those whose codepoint in hexadecimal can be expressed with 4 digits or less.
How to Go Thru Character (Not Code Unit)
If you have a string that may contain any character with codepoint ≥ 2^16, the result of any string method may be unexpected.
Solution is to use for-of Loop to go thru string.
Real Length: Number of Characters in String
JavaScript, String, Char, Encoding, Hexadecimal
JavaScript, String
- JS: String Overview
- JS: Quote String
- JS: Template String
- JS: String Escape Sequence
- JS: Unicode Escape Sequence
- JS: String Operations
- JS: Iterate String
- JS: String Code Unit
- JS: Count Chars in String 🚀
- JS: Tagged Template String
- JS: Regex Functions
- JS: Convert String, Number
- JS: String Object
- JS: String Constructor
- JS: String.prototype