What is Code Unit?
Each “element” in a string is technically not a “character”, but a “code unit”.
If the string contains ASCII characters only, then, each “code unit” is a character.
If the string contains a character whose Codepoint is ≥ 2^16 (e.g. 😂) , that character is 2 code units, thus occupies more than 1 index. Result of string functions may be unexpected.
Here is a example showing the difference of character and code unit:
console.log("😂".length === 2); // true // 😂 // name: FACE WITH TEARS OF JOY // codepoint in decimal: 128514
Here is a example with String.prototype.slice method with unexpected result.
// we want to take the substring abc console.log(("😂abc".slice(1) === "abc") === false); // true // wrong // 😂 // name: FACE WITH TEARS OF JOY // codepoint in decimal: 128514
Code Unit Explained
Here is a more detailed explanation.
- Each character has an integer ID, called Codepoint.
- Unicode specifies several encoding standards, most popular ones are • UTF-8 • UTF-16.
- Encoding means, a standard that translate a character into sequence of Bytes. [see Unicode Basics: Character Set, Encoding, UTF-8, Codepoint]
- UTF-16 encoding converts each character into 2 or 4 bytes, depending on the character. (each 2 bytes is considered a unit, called code unit.)
- For characters whose codepoint is less than 2^16, the encoding of that char in UTF-16 is 2 bytes. Otherwise, it's 4 bytes.
What Characters Create Problems?
Characters whose codepoint ≥ 2^16 are emoji and other less frequently used characters, e.g., rarely used Chinese characters, ancient language characters.
[see Unicode Emoji 😄]
Character whose codepoint is less than 2^16 are those whose codepoint in hexadecimal can be expressed with 4 digits or less.
How to Go Thru Character (Not Code Unit)
If you have a string that may contain any character with codepoint ≥ 2^16, the result of any string method may be unexpected.
Solution is to use for-of Loop to go thru string.
Real Length: Number of Characters in String
- String Overview
- Quote String
- Template String
- String Escape Sequence
- Unicode Escape Sequence
- String Operators/Function/Methods
- Iterate Chars in String
- String Code Unit
- Count Chars in String 🚀
- Tagged Template String
- Regex Functions
- Convert String/Number