JS: Character, Code Unit, Codepoint

By Xah Lee. Date: . Last updated: .

What's Code Unit?

JavaScript string is sequence of 16-bits values (called “code unit”) that represent characters from UTF-16 encoding.

Each “element” in a string is technically not a “character”, but a “code unit”.

If the string contains ASCII characters only, then, each “code unit” is a character.

If the string contains NON-ASCII character, that character may be represented by 2 code units, and the result of string operation and methods may not be what you expect.

Here's a example showing the difference of character and code unit:

// js strings are sequence of 16-bits values, not character

console.log("😂".length === 2); // true

// 😂
// name: FACE WITH TEARS OF JOY
// codepoint in decimal: 128514

String Code Unit Explained

Here's a more detailed explanation.

  1. JavaScript string and character are based on Unicode standard, version 5.1 or later.
  2. Each Unicode character has an integer ID, called “codepoint”.
  3. Unicode specifies several encoding standards, most popular ones are UTF-8, UTF-16.
  4. Encoding means, a standard that translate a character into sequence of bits.
  5. UTF-16 encoding converts each character into 16 or 32 bits, depending on the character. (each 16 bits is considered a unit, called “code unit”.)
  6. For characters whose codepoint is less than 2^16, the encoding of that char in UTF-16 is 16 bits. Otherwise, it's 32 bits.
  7. JavaScript defines “element” of string as sequence of 16 bits values of the characters encoded in UTF-16. That is, first encode the character in the string to bits by UTF-16, you get a sequence of bits. Then, group every 16 bits as a “code unit”. Then, index 0 is first 16 bits unit, index 1 is second 16 bits unit, etc. This means, when a string contain character whose codepoint is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.

What's Codepoint?

Each Unicode character has a ID. It is a integer, starting at 0. This number is called the character's codepoint.

Unicode Codepoint Example
charnamecodepointcodepoint in hexUTF-8UTF-16
aLATIN SMALL LETTER A97616161
αGREEK SMALL LETTER ALPHA9453b1CE B103 B1
😂FACE WITH TEARS OF JOY1285141f602F0 9F 98 82D8 3D DE 02

(it's not called “character id”, because some “character” are not really “character”, such as space, return, tab, left-to-right marker, etc.)

[see Unicode Basics: Character Set, Encoding, UTF-8]

[see JS: Convert Character To/From Codepoint]

What Characters Create Problems?

Characters whose codepoint ≥ 2^16 are emoji and other less frequently used characters, e.g., rarely used Chinese characters, ancient language characters.

[see Unicode Emoji 😄]

Character whose codepoint is less than 2^16 are those whose codepoint in hexadecimal can be expressed with 4 digits or less.

More Bad Examples

Here's a example with “slice” method with unexpected result.

// we want to take the second char the capital X

console.log ( "😂X".slice(1)); // prints �X
// wrong

// 😂
// name: FACE WITH TEARS OF JOY
// codepoint in decimal: 128514

[see JS: String.prototype.slice]

How to Go Thru Character (Not Code Unit)

If you have a string that may contain any character with codepoint ≥ 2^16, the result of any string method may not be what you think it is.

What you can do is to use the (ES2015) for-of loop to go thru string. The for-of loop goes thru string by character, not by 16-bit units.

[see JS: for-of Loop]

Real Length: Number of Characters in String

/* [
xah_string_real_length(str)
 return number of chars in string str.
 http://xahlee.info/js/js_string_byte_sequence.html
 version 2018-06-17
 ] */
const xah_string_real_length = (str => {
    let i = 0;
    for (let c of str) {
        i += 1;
    }
    return i;
 });

// --------------------------------------------------
// test
console.log ( xah_string_real_length("😂") === 1 ); // true

[see Unicode Basics: Character Set, Encoding, UTF-8]

Character (String) to Codepoint (Integer)

JS: String.prototype.codePointAt

Codepoint (Integer) to Character (String)

JS: String.fromCodePoint

Code Unit (Integer) to Character (String)

JS: String.fromCharCode

Character (String) to Code Unit (As String of Length 1)

JS: String.prototype.charAt

Character (String) to Code Unit (As Integer)

JS: String.prototype.charCodeAt

JS Character

  1. Character, Code Unit, Codepoint
  2. Character To/From Codepoint
  3. String.fromCodePoint
  4. String.fromCharCode
  5. String.prototype.charAt
  6. String.prototype.charCodeAt
  7. String.prototype.codePointAt
  8. Convert Decimal/Hex
  9. Unicode Escape Sequence

JS String

  1. String Overview
  2. Template String
  3. Char, Code Unit, Codepoint
  4. String Escape Sequence
  5. Unicode Escape Sequence
Liket it? I spend 2 years writing this tutorial. Help me spread it. Tell your friends. Or, Put $5 at patreon.

Or, Buy JavaScript in Depth

If you have a question, put $5 at patreon and message me.

Web Dev Tutorials

  1. HTML
  2. Visual CSS
  3. JS in Depth
  4. JS Object Ref
  5. DOM Scripting
  6. SVG
  7. Blog