JS: String Code Unit vs Code Point

By Xah Lee. Date: . Last updated: .

JavaScript string is sequence of 16-bits values (called “code units”) that represent characters from UTF-16 encoding.

ECMAScript 2015 §ECMAScript Data Types and Values#sec-ecmascript-language-types-string-type

Each “element” in a string is technically not a “character”, but a “code unit”.

If the string contains a character whose code point is ≥ 2^16, the result may not be expected.

Character whose code point is ≥ 2^16 are either emoji, or other less used characters.

[see Unicode Emoji 😄]

Character whose code point is ≺ 2^16 are those whose code point in hexadecimal can be expressed with 4 digits or less.

Here's a example that shows the difference.

// js strings are sequence of 16-bits values, not character

console.log("😂".length === 2); // true
// wrong

// 😂
// codepoint decimal: 128514
// codepoint hexadecimal: 1f602

Here a example with “slice” method.

// we want to take the first char

console.log ( "😂X".slice(0, 1));
// prints �
// WRONG
// character: � (codepoint 65533, #o177775, #xfffd)

// 😂
// codepoint decimal: 128514
// codepoint hexadecimal: 1f602
// utf-8 encoding: #xF0 #x9F #x98 #x82
// utf-16 encoding: #xFE #xFF #xD8 #x3D #xDE #x02

What's a Codepoint?

Each Unicode character has a ID. It is a integer, starting at 0. This number is called the character's codepoint.

Unicode Code Point Example
charcodepointcodepoint in hex
a9761
α9453b1
😂1285141f602

(it's not called “character id”, because some “character” are not really “character”, such as space, return, tab, left-to-right marker, etc.)

[see Unicode Basics: What's Character Set, Character Encoding, UTF-8?]

JavaScript String Encoding Explained

  1. JavaScript string and character are based on Unicode standard, version 5.1 or later.
  2. Each Unicode character has an integer ID, called “code point”.
  3. Unicode specifies several encoding standards, most popular ones are UTF-8, UTF-16.
  4. Encoding means, a standard that translate a character into sequence of bits.
  5. JavaScript string are encoded using UTF-16, by spec.
  6. UTF-16 encoding converts each character into 2 bytes, or more. (at least 2 bytes. 1 byte is 8 bits.)
  7. For characters whose code point is less than 2^16, the encoding of that char in UTF-16 is 16 bits. (2 bytes)
  8. Characters that have code point ≥ 2^16 are rarely used except emoji. For example, rarely used Chinese characters, ancient language characters, or new emoji, such as 😂 (codepoint 128514, #x1f602).
  9. When a string contains a character whose code point is ≥ 2^16, the encoding for that char has 3 or more bytes. That is, more than 16 bits. For example, the character 𐀀 (U+10000: LINEAR B SYLLABLE B008 A) has code point of 65536, which is exactly 2^16, and the character's encoding in UTF-16 is 3 bytes, xFE xFF xD8, and "𐀀".length is 2.
  10. JavaScript defines “element” of string as 16 bits values of the string encoding. That is, index 0 is first 16 bits unit, index 1 is second 16 bits unit, etc. This means, when a string contain character whose code point is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.

If you have a string that may contain any character with code point ≥ 2^16, the result of any string method may not be what you think it is.

What you can do is to use the (ES2015) for-of loop to go thru string. The for-of loop goes thru string by character, not by 16-bit units.

[see JS: for-of Loop]

Real Length: Number of Characters in String

// function that returns number of chars in string
const stringRealLength = (str => {
    let i = 0;
    for (let c of str) {
        i += 1;
    }
    return i;
 });

// test
console.log ( stringRealLength("😸") ); // 1

// GRINNING CAT FACE WITH SMILING EYES, 128568, U+1F638
console.log("😸".length); // 2

ECMAScript 2015 §ECMAScript Language: Source Code#sec-ecmascript-language-source-code

[see Unicode Basics: What's Character Set, Character Encoding, UTF-8?]

Character Topic

  1. JS: String Code Unit vs Code Point
  2. JS: Convert Character To/From Codepoint
  3. JS: String.fromCodePoint
  4. JS: String.fromCharCode
  5. JS: String.prototype.charAt
  6. JS: String.prototype.charCodeAt
  7. JS: String.prototype.codePointAt
  8. JS: Convert Decimal/Hexadecimal
  9. JS: Unicode Escape Sequence

String Topic

  1. JS: String Overview
  2. JS: Template String
  3. JS: String Object
  4. JS: String.prototype
  5. JS: String Code Unit vs Code Point
  6. JS: String Escape Sequence
  7. JS: Unicode Escape Sequence
  8. JS: Source Code Encoding
  9. JS: Allowed Characters in Identifier
  10. JS: Convert String to Number
  11. JS: Encode URL, Escape String
  12. JS: Format Number
  13. JS: JSON
Liket it? Put $5 at patreon.

Or, Buy JavaScript in Depth

If you have a question, put $5 at patreon and message me.