JS: Character, Code Unit, Code Point

By Xah Lee. Date: . Last updated: .

When we see a string, such as "xyz", we think each of the x y z as characters.

// js strings are sequence of 16-bits values, not character

console.log("😂".length === 2); // true
// wrong

// 😂
// name: FACE WITH TEARS OF JOY
// codepoint in decimal: 128514

Technically, JavaScript does not have character type.

JavaScript string is sequence of 16-bits values (called “code unit”) that represent characters from UTF-16 encoding.

ECMAScript 2015 §ECMAScript Data Types and Values#sec-ecmascript-language-types-string-type

Each “element” in a string is technically not a “character”, but a “code unit”.

The significance is this: If the string contains a character whose Unicode codepoint (the id of the character) is ≥ 2^16, the result of most string methods may not be what you expect.

What's Codepoint?

Each Unicode character has a ID. It is a integer, starting at 0. This number is called the character's codepoint.

Unicode Codepoint Example
charnamecodepointcodepoint in hexUTF-8UTF-16
aLATIN SMALL LETTER A97616161
αGREEK SMALL LETTER ALPHA9453b1CE B103 B1
😂FACE WITH TEARS OF JOY1285141f602F0 9F 98 82D8 3D DE 02

(it's not called “character id”, because some “character” are not really “character”, such as space, return, tab, left-to-right marker, etc.)

[see Unicode Basics: What's Character Set, Character Encoding, UTF-8?]

[see JS: Convert Character To/From Codepoint]

What's JavaScript String Code Unit?

  1. JavaScript string and character are based on Unicode standard, version 5.1 or later.
  2. Each Unicode character has an integer ID, called “codepoint”.
  3. Unicode specifies several encoding standards, most popular ones are UTF-8, UTF-16.
  4. Encoding means, a standard that translate a character into sequence of bits.
  5. UTF-16 encoding converts each character into 16 or 32 bits, depending on the character. (each 16 bits is considered a unit.)
  6. For characters whose codepoint is less than 2^16, the encoding of that char in UTF-16 is 16 bits.
  7. JavaScript defines “element” of string as sequence of 16 bits values of the characters encoded in UTF-16. That is, first encode the character in the string to bits by UTF-16, you get a sequence of bits. Then, group every 16 bits as a “code unit”. Then, index 0 is first 16 bits unit, index 1 is second 16 bits unit, etc. This means, when a string contain character whose codepoint is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.

What Characters Create Problems?

Characters whose codepoint ≥ 2^16 are emoji and other less frequently used characters, e.g., rarely used Chinese characters, ancient language characters.

[see Unicode Emoji 😄]

Character whose codepoint is less than 2^16 are those whose codepoint in hexadecimal can be expressed with 4 digits or less.

2^16 is 65536.

More Bad Examples

Here's a example with “slice” method with unexpected result.

// we want to take the second char the capital X

console.log ( "😂X".slice(1)); // prints �X
// wrong

// 😂
// name: FACE WITH TEARS OF JOY
// codepoint in decimal: 128514

[see JS: String.prototype.slice]

Solution

If you have a string that may contain any character with codepoint ≥ 2^16, the result of any string method may not be what you think it is.

What you can do is to use the (ES2015) for-of loop to go thru string. The for-of loop goes thru string by character, not by 16-bit units.

[see JS: for-of Loop]

Real Length: Number of Characters in String

const xah_string_real_length = (str => {
/* [
 returns number of chars in string.
 http://xahlee.info/js/js_string_byte_sequence.html
 version 2018-06-17
 ] */
    let i = 0;
    for (let c of str) {
        i += 1;
    }
    return i;
 });

// --------------------------------------------------
// test
console.log ( xah_string_real_length("😂") === 1 ); // true

ECMAScript 2015 §ECMAScript Language: Source Code#sec-ecmascript-language-source-code

[see Unicode Basics: What's Character Set, Character Encoding, UTF-8?]

Character Topic

  1. JS: Character, Code Unit, Code Point
  2. JS: Convert Character To/From Codepoint
  3. JS: String.fromCodePoint
  4. JS: String.fromCharCode
  5. JS: String.prototype.charAt
  6. JS: String.prototype.charCodeAt
  7. JS: String.prototype.codePointAt
  8. JS: Convert Decimal/Hexadecimal
  9. JS: Unicode Escape Sequence

JS String

  1. String Overview
  2. Template String
  3. Char, Code Unit, Code Point
  4. String Escape Sequence
  5. Unicode Escape Sequence
  6. String to Number
  7. Encode URL, Escape String
  8. Format Number
  9. Source Code Encoding
  10. Allowed Characters in Identifier
  11. String Object
  12. String.prototype
Liket it? Put $5 at patreon.

Or, Buy JavaScript in Depth

If you have a question, put $5 at patreon and message me.

Web Dev Tutorials

  1. HTML
  2. Visual CSS
  3. JS in Depth
  4. JS Reference
  5. DOM
  6. SVG
  7. Web Dev Blog