Unicode: UTF-16 Encoding

By Xah Lee. Date: . Last updated: .

What is UTF-16

UTF-16 stand for Unicode Transformation Format, 16-bit.

[see Unicode Basics: Character Set, Encoding, UTF-8, Codepoint]

UTF-16 is very common in 1990s and 2000s.

But by 2010s, most computing tech converged to Unicode: UTF-8 Encoding as the standard.

UTF-16 Encoding Scheme

UTF-16 is a variable width character encoding scheme. Each Codepoint of unicode is encoded to into 2 Bytes or 4 bytes. (a byte is 8 bits here.)

A Code Unit is a 2 bytes (16-bits) unit in UTF-16.

Codepoints less than 2^16 are encoded with one 16-bit code unit equal to the numerical value of the code point.

Codepoints greater or equal 2^16 are encoded into two 16-bit code unit. the first 2 bytes are

These two 16-bit code units are chosen from so-called surrogate range of D800 to DFFF codepoint values. (they do not have character assigned) [see Binary/Hexadecimal Converter]

Unicode: Surrogate Pair

Unicode and Encoding Explained






How To



Unicode for Programers