JS: String Code Unit vs Code Point

By Xah Lee. Date: . Last updated: .

JavaScript string is sequence of 16-bits values (called “code units”) that represent characters from UTF-16 encoding.

ECMAScript 2015 §ECMAScript Data Types and Values#sec-ecmascript-language-types-string-type

Each element in a string is not a character.

If the string contains a character whose code point is ≥ 2^16, the result may not be expected.

Character whose code point is ≥ 2^16 are either emoji, or other less used characters.

Character whose code point is ≺ 2^16 are those whose code point in hexadecimal can be expressed with 4 digits or less.

[see Unicode Emoji 😄 😱 👽]

Here's a example that shows the difference.

// js strings are sequence of 16-bits values, not character

console.log("😂".length === 2); // true
// wrong

// 😂
// codepoint decimal: 128514
// codepoint hexadecimal: 1f602

Here a example with “slice” method.

// we want to take the first char

console.log ( "😂X".slice(0, 1));
// prints �
// WRONG
// character: � (codepoint 65533, #o177775, #xfffd)

// 😂
// codepoint decimal: 128514
// codepoint hexadecimal: 1f602
// utf-8 encoding: #xF0 #x9F #x98 #x82
// utf-16 encoding: #xFE #xFF #xD8 #x3D #xDE #x02

What's a Codepoint?

Each Unicode character has a ID. It is a integer, starting at 0. This number is called the character's codepoint.

Unicode Code Point Example
charcodepointcodepoint in hex
a9761
α9453b1
😂1285141f602

(it's not called “character id”, because some “character” are not really “character”, such as space, return, tab, left-to-right marker, etc.)

[see Unicode Basics: What's Character Set, Character Encoding, UTF-8?]

JavaScript String Encoding Explained

  1. JavaScript string and character are based on Unicode standard, version 5.1 or later.
  2. Each Unicode character has an integer ID, called “code point”.
  3. Unicode specifies several encoding standards, most popular ones are UTF-8, UTF-16.
  4. Encoding means, a standard that translate a character into sequence of bits.
  5. JavaScript string are encoded using UTF-16, by spec.
  6. UTF-16 encoding converts each character into 2 bytes, or more. (at least 2 bytes. 1 byte is 8 bits.)
  7. For characters whose code point is less than 2^16, the encoding of that char in UTF-16 is 16 bits. (2 bytes)
  8. Characters that have code point ≥ 2^16 are rarely used except emoji. For example, rarely used Chinese characters, ancient language characters, or new emoji, such as 😂 (codepoint 128514, #x1f602).
  9. When a string contains a character whose code point is ≥ 2^16, its encoding is 3 bytes or more. That is, more than 16 bits. For example, the character 𐀀 (U+10000: LINEAR B SYLLABLE B008 A) has code point of 65536, which is exactly 2^16, and the character's encoding in UTF-16 is 3 bytes, xFE xFF xD8, and "𐀀".length is 2.
  10. JavaScript defines “element” of string as 16 bits values of the string encoding. That is, index 0 is first 16 bits unit, index 1 is second 16 bits unit, etc. This means, when a string contain character whose code point is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.

If you have a string that may contain any character with code point ≥ 2^16, the result of any string method may not be what you think it is.

What you can do is to use the (ES2015) for-of loop to go thru string. The for-of loop goes thru string by character, not by 16-bit units.

[see JS: for-of Loop]

Real Length: Number of Characters in String

// function that returns number of chars in string
const stringRealLength = (str => {
    let i = 0;
    for (let c of str) {
        i += 1;
    }
    return i;
 });

// test
console.log ( stringRealLength("😸") ); // 1

// GRINNING CAT FACE WITH SMILING EYES, 128568, U+1F638
console.log("😸".length); // 2

ECMAScript 2015 §ECMAScript Language: Source Code#sec-ecmascript-language-source-code

[see Unicode Basics: What's Character Set, Character Encoding, UTF-8?]

Character Topic

  1. JS: String Code Unit vs Code Point
  2. JS: Convert Character To/From Codepoint
  3. JS: String.fromCodePoint
  4. JS: String.fromCharCode
  5. JS: String.prototype.charAt
  6. JS: String.prototype.charCodeAt
  7. JS: String.prototype.codePointAt
  8. JS: Convert Decimal/Hexadecimal
  9. JS: Unicode Character Escape Sequence

String Topic

  1. JS: Source Code Charset/Encoding
  2. JS: String Code Unit vs Code Point
  3. JS: Unicode Character Escape Sequence
  4. JS: Allowed Characters in Identifier
  5. HTML: Allowed Characters in id Attribute
  6. HTML: Character Sets and Encoding
  7. HTML/XML Entity List

  1. JS: Template String
  2. JS: Convert String to Number
  3. JS: Encode URL, Escape String
  4. JS: Format Number
  5. JS: JSON Object

  1. JS: String Object
  2. JS: String.prototype
  3. JS: String Constructor
Liket it? Put $5 at patreon.

Or, Buy JavaScript in Depth

Ask me question on patreon