Unicode: UTF-16 Encoding

By Xah Lee. Date: 2022-10-16. Last updated: 2022-10-21.

What is UTF-16

UTF-16 stand for Unicode Transformation Format, 16-bit.

〔see Unicode: Character Set, Encoding, UTF-8, Codepoint〕

UTF-16 encoding is first published in 1996.
UTF-16 encoding is extended from USC-2 encoding.
USC-2 is published in 1990.
USC-2 is fixed-length encoding. Each codepoint is 2 Bytes. UTF-16 extended this by making it a variable-length, by specifying that for characters whose Codepoint is greater than 2^16, additional 2 bytes are required.

UTF-16 is very common in 1990s and 2000s.

UTF-16 is used by JavaScript as the basis of its string. 〔see JS: String Code Unit〕
UTF-16 is used by Java as the basis of its string.
UTF-16 is used by Microsoft Windows's NTFS file system for file names.
UTF-16 is used by Apple's HFS Plus file system (1998 to 2017) for file names.

But by 2010s, most computing tech converged to Unicode: UTF-8 Encoding as the standard.

UTF-16 Encoding Scheme

UTF-16 is a variable-width character encoding scheme. Each Codepoint of unicode is encoded to into 2 Bytes or 4 bytes. (a byte is 8 bits here.)

A Code Unit is a 2 bytes (16-bits) unit in UTF-16.

Codepoints less than 2^16 are encoded with one 16-bit code unit equal to the numerical value of the code point.

Codepoints greater or equal 2^16 are encoded into two 16-bit code unit. the first 2 bytes are

These two 16-bit code units are chosen from so-called surrogate range of D800 to DFFF codepoint values. (they do not have character assigned) 〔see Binary / Hexadecimal Converter〕

Unicode: Surrogate Pair

Unicode: UTF-16 Encoding

What is UTF-16

UTF-16 Encoding Scheme

Unicode and Encoding Explained