JavaScript: String Code Unit

By Xah Lee. Date: . Last updated: .

JavaScript does not have a character datatype. String is used as a sequence of characters. However, JavaScript string is technically a sequence of 16-bit units, not character. This page explains the detail for working with string as sequece of characters.

What is Code Unit?

JavaScript string is sequence of 16-bits values (called “code unit”) that represent characters from UTF-16 Encoding .

Each “element” in a string is technically not a “character”, but a “code unit”.

If the string contains ASCII characters only, then, each “code unit” is a character.

If the string contains a character whose Codepoint is ≥ 2^16 (e.g. 😂) , that character is 2 code units, thus occupies more than 1 index. Result of string functions may be unexpected.

Here is a example showing the difference of character and code unit:

console.log("😂".length === 2); // true

// 😂
// name: FACE WITH TEARS OF JOY
// codepoint in decimal: 128514

Here is a example with String.prototype.slice method with unexpected result.

// we want to take the substring abc

console.log(("😂abc".slice(1) === "abc") === false); // true
// wrong

// 😂
// name: FACE WITH TEARS OF JOY
// codepoint in decimal: 128514

Code Unit Explained

Here is a more detailed explanation.

  1. JavaScript string and character are based on Unicode standard, version 5.1 or later.
  2. Each character has an integer ID, called Codepoint.
  3. Unicode specifies several encoding standards, most popular ones are • UTF-8UTF-16.
  4. Encoding means, a standard that translate a character into sequence of Bytes. [see Unicode Basics: Character Set, Encoding, UTF-8, Codepoint]
  5. UTF-16 encoding converts each character into 2 or 4 bytes, depending on the character. (each 2 bytes is considered a unit, called “code unit”.)
  6. For characters whose codepoint is less than 2^16, the encoding of that char in UTF-16 is 2 bytes. Otherwise, it's 4 bytes.
  7. JavaScript defines “element” of string as sequence of 2-byte values of the characters encoded in UTF-16. That is, first encode the character in the string to bits by UTF-16, you get 2 or 4 bytes. Then, group every 2 bytes as a “code unit”. Then, index 0 is first 2-bytes unit, index 1 is second 2-bytes unit, etc. This means, when a string contain character whose codepoint is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.

What Characters Create Problems?

Characters whose codepoint ≥ 2^16 are emoji and other less frequently used characters, e.g., rarely used Chinese characters, ancient language characters.

[see Unicode Emoji 😄]

Character whose codepoint is less than 2^16 are those whose codepoint in hexadecimal can be expressed with 4 digits or less.

How to Go Thru Character (Not Code Unit)

If you have a string that may contain any character with codepoint ≥ 2^16, the result of any string method may be unexpected.

Solution is to use for-of Loop to go thru string.

Real Length: Number of Characters in String

JavaScript String, Char, Encoding, Hexadecimal

JavaScript String

BUY
ΣJS
JavaScript in Depth

JavaScript in Depth

Basic Syntax

Value Types

Variable

String

Property

Object and Inheritance

Array

Function

Constructor/Class

Iterable 🌟

Misc