JS: String Code Unit

By Xah Lee. Date: . Last updated: .

What is Code Unit?

JavaScript string is sequence of 16-bits values (called code unit) that represent characters from UTF-16 Encoding .

Each “element” in a string is technically not a “character”, but a “code unit”.

If the string contains ASCII characters only, then, each “code unit” corresponds to a character.

If the string contains a character whose Codepoint is ≥ 2^16 (e.g. 😂) , that character is 2 code units, thus occupies more than 1 index. Result of string functions may be unexpected.

Example: Difference of Character and Code Unit

console.log("😂".length === 2); // true

// 😂
// name: FACE WITH TEARS OF JOY
// codepoint in decimal: 128514

Here is a example with String.prototype.slice method with unexpected result.

// we want to take the substring abc

console.log(("😂abc".slice(1) === "abc") === false); // true
// wrong

// 😂
// name: FACE WITH TEARS OF JOY
// codepoint in decimal: 128514

Code Unit Explained

Here is a more detailed explanation.

  1. JavaScript string and character are based on Unicode standard, version 5.1 or later.
  2. In unicode, each character has an integer ID, called Codepoint.
  3. Unicode specifies several encoding standards, most popular ones are • UTF-8UTF-16.
  4. Encoding means, a standard that translate a character into sequence of Bytes. 〔see Unicode: Character Set, Encoding, UTF-8, Codepoint
  5. UTF-16 encoding converts each character into 2 or 4 bytes, depending on the character. (each 2 bytes is considered a unit, called code unit.)
  6. For characters whose codepoint is less than 2^16, the encoding of that char in UTF-16 is 2 bytes. Otherwise, it's 4 bytes.
  7. JavaScript defines element of string as sequence of 2-byte values of the characters encoded in UTF-16. That is, first encode the character in the string to bits by UTF-16, you get 2 or 4 bytes. Then, group every 2 bytes as a code unit. Then, index 0 is first 2-bytes unit, index 1 is second 2-bytes unit, etc. This means, when a string contain character whose codepoint is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.

What Characters Create Problems?

Characters whose codepoint ≥ 2^16 are emoji and other less frequently used characters, e.g. rarely used Chinese characters, ancient language characters.

〔see Unicode: Emoji 😄

Character whose codepoint is less than 2^16 are those whose codepoint in hexadecimal can be expressed with 4 digits or less.

How to Go Thru Character (Not Code Unit)

If you have a string that may contain any character with codepoint ≥ 2^16, the result of any string method may be unexpected.

Solution is to use for-of Loop to go thru string.

Real Length: Number of Characters in String

JavaScript, String, Char, Encoding, Hexadecimal

JavaScript, String