JavaScript: String is 16-Bit Unit Sequence

By Xah Lee. Date: . Last updated: .

JavaScript string is sequence of 16-bits values that represent characters from UTF-16 encoding. ECMAScript 2015 §ECMAScript Data Types and Values#sec-ecmascript-language-types-string-type

Each element in a string is not a character.

If the string contains a character whose code point is ≥ 2^16, the result may not be expected.

Character whose code point is ≥ 2^16 are those lies outside the Unicode basic multi-lingual plane. (such as emoji 😸 〔►see Unicode Emoji 😄 😱 👽〕 )

Here's a example that shows the difference.

// js strings are sequence of 16-bits values, not character

// GRINNING CAT FACE WITH SMILING EYES, 128568, U+1F638
console.log("😸".length); // 2

// GREEK SMALL LETTER ALPHA, 945, U+3B1
console.log("α".length); // 1

Here a example with “slice” method.

var aa = "αX"; // GREEK SMALL LETTER ALPHA, 945, U+3B1
var bb = "😸X"; // GRINNING CAT FACE WITH SMILING EYES, 128568, U+1F638

// we want to take the first char
console.log(aa.slice(0, 1)); // α
console.log(bb.slice(0, 1)); // � WRONG!

JavaScript String Unit Explained

  1. JavaScript string and character are based on Unicode standard, version 5.1 or later.
  2. Each Unicode character assigns an integer ID, called “code point” to each character.
  3. Unicode specifies several encoding standards, most popular ones are UTF-8, UTF-16.
  4. Encoding means, a standard that translate a character into sequence of bits.
  5. JavaScript string are encoded using UTF-16, by spec.
  6. UTF-16 encoding converts each character into 2 bytes, or more. (at least 2 bytes. 1 byte is 8 bits.)
  7. For characters whose code point is less than 2^16, the encoding of that char in UTF-16 is 16 bits. (2 bytes)
  8. Characters that have code point ≥ 2^16 are rarely used, such as rarely used Chinese characters, ancient language characters, or new emoji, such as 😸 (U+1F638, code point 128568 in decimal.).
  9. When a string contains a character whose code point is ≥ 2^16, its encoding is 3 bytes or more. That is, more than 16 bits. For example, the character 𐀀 (U+10000: LINEAR B SYLLABLE B008 A) has code point of 65536, which is exactly 2^16, and the character's encoding in UTF-16 is 3 bytes, xFE xFF xD8, and "𐀀".length is 2.
  10. JavaScript defines “element” of string as 16 bits values of the string encoding. That is, index 0 is first 16 bits unit, index 1 is second 16 bits unit, etc. This means, when a string contain character whose code point is ≥ 2^16, the result of any string method, may be unexpected, because the index does not correspond to character.

If you have a string that may contain any character with code point ≥ 2^16, the result of any string method may not be what you think it is.

What you can do is to use the (ES2015) for-of loop to go thru string. The for-of loop goes thru string by character, not by 16-bit units.

〔►see JavaScript: for-of Loop

Real Length: Number of Characters in String

// function that returns number of chars in string
const stringRealLength = (str => {
    let i = 0;
    for (let c of str) {
        i += 1;
    }
    return i;
 });

// test
console.log ( stringRealLength("😸") ); // 1

// GRINNING CAT FACE WITH SMILING EYES, 128568, U+1F638
console.log("😸".length); // 2

ECMAScript 2015 §ECMAScript Language: Source Code#sec-ecmascript-language-source-code

〔►see Unicode Basics: What's Character Set, Character Encoding, UTF-8?

Character Topic

  1. JavaScript: String is 16-Bit Unit Sequence
  2. JavaScript: Convert Character To/From Codepoint
  3. JavaScript: String.fromCodePoint
  4. JavaScript: String.fromCharCode
  5. JavaScript: String.prototype.charAt
  6. JavaScript: String.prototype.charCodeAt
  7. JavaScript: String.prototype.codePointAt
  8. JavaScript: Convert Decimal/Hexadecimal

String Topic

  1. JavaScript: Default Charset/Encoding
  2. JavaScript: String is 16-Bit Unit Sequence
  3. JavaScript: Unicode Character Escape Sequence
  4. JavaScript: Allowed Characters in Identifier
  5. HTML: Allowed Characters in id Attribute
  6. HTML: Character Sets and Encoding
  7. HTML XML Entities

  1. JavaScript: Template String
  2. JavaScript: Convert String to Number
  3. JavaScript Encode URL, Escape String
  4. JavaScript: Format Number
  5. JavaScript: JSON Object

  1. JavaScript: String Object
  2. JavaScript: String.prototype
  3. JavaScript: String Constructor
Like what you read? Buy JavaScript in Depth
or, buy a new keyboard, see Keyboard Reviews.