JS: String Code Unit

By Xah Lee. Date: . Last updated: .

What is Code Unit?

A Code Unit, is 2 bytes unit of a Unicode character in UTF-16 encoding.

Example: Difference of Character and Code Unit

console.log("🦋".length);
// 2

Here is a example with String.prototype.slice method with unexpected result.

// we want to take the substring abc
console.log("🦋abc".slice(1));
// �abc
// incorrect result

Code Unit Explained

Here is a more detailed explanation.

  1. JavaScript string and character are based on Unicode standard, version 5.1 or later.
  2. In unicode, each character has an integer ID, called Code Point.
  3. Unicode specifies several encoding standards, most popular ones are • UTF-8UTF-16.
  4. Encoding means, a standard that translate a character into sequence of Bytes. 〔see Unicode: Character Set, Encoding, UTF-8, Code Point
  5. UTF-16 encoding converts each character into 2 or 4 bytes, depending on the character. (each 2 bytes is considered a unit, called code unit.)
  6. For characters whose codepoint is less than 2^16, the encoding of that char in UTF-16 is 2 bytes. Otherwise, it's 4 bytes.
  7. JavaScript defines element of string as sequence of 2-byte values of the characters encoded in UTF-16. That is, first encode the character in the string to bits by UTF-16, you get 2 or 4 bytes. Then, group every 2 bytes as a code unit. Then, index 0 is first 2-bytes unit, index 1 is second 2-bytes unit, etc. This means, when a string contain character whose codepoint is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.

What Characters Create Problems?

Characters whose codepoint ≥ 2^16 are emoji and other less frequently used characters, e.g. rarely used Chinese characters, ancient language characters.

〔see Emoji: Faces 😄

Character whose codepoint is less than 2^16 are those whose codepoint in hexadecimal can be expressed with 4 digits or less.

How to Go Thru Character (Not Code Unit)

If you have a string that may contain any character with codepoint ≥ 2^16, the result of any string method may be unexpected.

Solution is to use for-of Loop to go thru string.

Real Length: Number of Characters in String

JavaScript. String, Char, Encoding, Hexadecimal

JavaScript. String