Unicode: UTF-16 Encoding
What is UTF-16
UTF-16 stand for Unicode Transformation Format, 16-bit.
[see Unicode: Character Set, Encoding, UTF-8, Codepoint]
- UTF-16 encoding is first published in 1996.
- UTF-16 encoding is extended from USC-2 encoding.
- USC-2 is published in 1990.
- USC-2 is fixed-length encoding. Each codepoint is 2 Bytes. UTF-16 extended this by making it a variable-length, by specifying that for characters whose Codepoint is greater than 2^16, additional 2 bytes are required.
UTF-16 is very common in 1990s and 2000s.
- UTF-16 is used by Java as the basis of its string.
- UTF-16 is used by Microsoft Windows's NTFS file system for file names.
- UTF-16 is used by Apple's HFS Plus file system (1998 to 2017) for file names.
But by 2010s, most computing tech converged to Unicode: UTF-8 Encoding as the standard.
UTF-16 Encoding Scheme
UTF-16 is a variable width character encoding scheme. Each Codepoint of unicode is encoded to into 2 Bytes or 4 bytes. (a byte is 8 bits here.)
A Code Unit is a 2 bytes (16-bits) unit in UTF-16.
Codepoints less than 2^16 are encoded with one 16-bit code unit equal to the numerical value of the code point.
Codepoints greater or equal 2^16 are encoded into two 16-bit code unit. the first 2 bytes are
These two 16-bit code units are chosen from so-called surrogate range of D800 to DFFF codepoint values. (they do not have character assigned) [see Binary/Hexadecimal Converter]
Unicode: Surrogate Pair