JS: String Index Code Unit
What is Code Unit?
A Code Unit, is 2 bytes unit of a Unicode character in UTF-16 encoding.
- In UTF-16 encoding, each Unicode character may have 2 or 4 bytes.
- If a character's Char ID (code point) is 16 bits value, its UTF-16 encoding is 2 bytes. Else, it is 4 bytes.
- JavaScript string is sequence of code units that represent Unicode characters. Each “code unit” may be “half of a character”.
- If the string contains a character whose Char ID is ≥ 2^16 (e.g. 🦋 (U+1F98B: BUTTERFLY) and most emoji, rare chinese characters, or rare language symbols) , that character is 2 code units, thus occupies 2 indexes. Result of string functions may be unexpected.
- Character whose Char ID is greater than 16 bits are outside of Unicode Basic Multilingual Plane.
Example: Difference of Character and Code Unit
console.log("🦋".length); // 2
Here is a example with String.prototype.slice method with unexpected result.
// we want to take the substring abc console.log("🦋abc".slice(1)); // �abc // result is not what we want
Code Unit Explained
Here is a more detailed explanation.
- JavaScript string and character are based on Unicode standard, version 5.1 or later.
- In unicode, each character has an integer ID, called Code Point.
- Unicode specifies several encoding standards, most popular ones are • UTF-8 • UTF-16.
- Encoding means, a standard that translate a character into sequence of Bytes. 〔see Unicode: Character Set, Encoding, UTF-8, Code Point〕
- UTF-16 encoding converts each character into 2 or 4 bytes, depending on the character. (each 2 bytes is considered a unit, called code unit.)
- For characters whose codepoint is less than 2^16, the encoding of that char in UTF-16 is 2 bytes. Otherwise, it's 4 bytes.
- JavaScript defines element of string as sequence of 2-byte values of the characters encoded in UTF-16. That is, first encode the character in the string to bits by UTF-16, you get 2 or 4 bytes. Then, group every 2 bytes as a code unit. Then, index 0 is first 2-bytes unit, index 1 is second 2-bytes unit, etc. This means, when a string contain character whose codepoint is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.
What Characters Create Problems?
Characters outside of Unicode Basic Multilingual Plane. Typically emoji.
How to Go Thru Character (Not Code Unit)
If you have a string that contains characters outside of Unicode Basic Multilingual Plane , the result of any string method may be unexpected.
Solution is to use for-of Loop or Array.from to go thru string.
Real Length: Number of Characters in String
JavaScript. String, Char, Encoding, Hexadecimal
- JS: String Index Code Unit
- JS: Convert Decimal, Hexadecimal
- JS: String.prototype.codePointAt (Char to Char ID) ❌
- JS: String.fromCodePoint (Char ID to Char)
- JS: Char to UTF-8 Encoding 📜
- JS: Char to UTF-16 Encoding 📜
- JS: String.prototype.charCodeAt (Char to Char ID) ❌
- JS: String.prototype.charAt (Extract Char at Index) ❌
- JS: String.prototype.at (Extract Char at Index)
- JS: String.fromCharCode (Char ID to Char) ❌
JavaScript. String
- JS: String Overview
- JS: Quote String
- JS: Apostrophe Delimiter String
- JS: Template String
- JS: String Escape Sequence
- JS: Unicode Escape Sequence
- JS: String Operations
- JS: Iterate String
- JS: String Index Code Unit
- JS: Count Chars in String 📜
- JS: Tagged Template String
- JS: Regular Expression Functions
- JS: Convert String and Number
- JS: String (class)
- JS: String Constructor
- JS: String.prototype