JavaScript: String Code Unit
JavaScript does not have a character datatype. String is used as a sequence of characters. However, JavaScript string is technically a sequence of 16-bit units, not character. This page explains the detail for working with string as sequece of characters.
What is Code Unit?
JavaScript string is sequence of 16-bits values (called “code unit”) that represent characters from UTF-16 encoding.
Each “element” in a string is technically not a “character”, but a “code unit”.
If the string contains ASCII characters only, then, each “code unit” is a character.
If the string contains a character whose codepoint is ≥ 2^16 (example 😂) , that character is 2 code units, thus occupies more than 1 index. Result of string functions may be unexpected.
Here is a example showing the difference of character and code unit:
console.log("😂".length === 2); // true // 😂 // name: FACE WITH TEARS OF JOY // codepoint in decimal: 128514
Code Unit Explained
Here is a more detailed explanation.
- JavaScript string and character are based on Unicode standard, version 5.1 or later.
- Each Unicode character has an integer ID, called “codepoint”.
- Unicode specifies several encoding standards, most popular ones are UTF-8, UTF-16.
- Encoding means, a standard that translate a character into sequence of bits.
- UTF-16 encoding converts each character into 16 or 32 bits, depending on the character. (each 16 bits is considered a unit, called “code unit”.)
- For characters whose codepoint is less than 2^16, the encoding of that char in UTF-16 is 16 bits. Otherwise, it's 32 bits.
- JavaScript defines “element” of string as sequence of 16 bits values of the characters encoded in UTF-16. That is, first encode the character in the string to bits by UTF-16, you get a sequence of bits. Then, group every 16 bits as a “code unit”. Then, index 0 is first 16 bits unit, index 1 is second 16 bits unit, etc. This means, when a string contain character whose codepoint is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.
What is Codepoint?
Each Unicode character has a ID. It is a integer, starting at 0. This number is called the character's codepoint.
char | name | codepoint | codepoint in hex | UTF-8 | UTF-16 |
---|---|---|---|---|---|
a | LATIN SMALL LETTER A | 97 | 61 | 61 | 61 |
α | GREEK SMALL LETTER ALPHA | 945 | 3b1 | CE B1 | 03 B1 |
😂 | FACE WITH TEARS OF JOY | 128514 | 1f602 | F0 9F 98 82 | D8 3D DE 02 |
(it's not called “character id”, because some “character” are not really “character”, such as space, return, tab, left-to-right marker, etc.)
- Unicode Basics: Character Set, Encoding, UTF-8, Codepoint
- Character to Codepoint → String.prototype.codePointAt
- Codepoint to Character → String.fromCodePoint
What Characters Create Problems?
Characters whose codepoint ≥ 2^16 are emoji and other less frequently used characters, e.g., rarely used Chinese characters, ancient language characters.
[see Unicode Emoji 😄]
Character whose codepoint is less than 2^16 are those whose codepoint in hexadecimal can be expressed with 4 digits or less.
More Bad Examples
Here is a example with “slice” method with unexpected result.
// we want to take the second char the capital X console.log( "😂X".slice(1)); // prints �X // wrong // 😂 // name: FACE WITH TEARS OF JOY // codepoint in decimal: 128514
[see String.prototype.slice]
How to Go Thru Character (Not Code Unit)
If you have a string that may contain any character with codepoint ≥ 2^16, the result of any string method may not be what you think it is.
What you can do is to use the (JS2015) for-of loop to go thru string. The for-of loop goes thru string by character, not by 16-bit units.
[see for-of Loop]
Real Length: Number of Characters in String
/* [ xah_string_real_length(str) return number of chars in string str. http://xahlee.info/js/js_string_byte_sequence.html version 2018-06-17 ] */ const xah_string_real_length = (str => { let i = 0; for (let c of str) { i += 1; } return i; }); // -------------------------------------------------- // test console.log( xah_string_real_length("😂") === 1 ); // true
[see Unicode Basics: Character Set, Encoding, UTF-8]
Convert String to Codepoint/CodeUnit
- Character (String) to Codepoint (Integer)
- String.prototype.codePointAt
- Character (String) to Code Unit (As Integer)
- String.prototype.charCodeAt
- Character (String) to Code Unit (As String of Length 1)
- String.prototype.charAt
Convert Codepoint/CodeUnit to String
- Codepoint (Integer) to Character (String)
- String.fromCodePoint
- Code Unit (Integer) to Character (String)
- String.fromCharCode