Unicode: UTF-8 Encoding

By Xah Lee. Date: 2022-10-16. Last updated: 2022-10-24.

What is UTF-8

UTF-8 stand for Unicode Transformation Format, 8-bit.

[see Unicode: Character Set, Encoding, UTF-8, Codepoint]

UTF-8 is compatible with ASCII encoding. Each ASCII Character has the same encoding as UTF-8. This means, their byte sequence is the same, for any file encoded in ASCII or UTF-8. (excluding the case if the file began with a Byte Order Mark.)

UTF-8 is published in 1993. [see Unicode UTF8 History, by Rob Pike]

UTF-8 is the most popular encoding, used by 98% of all web pages in the world as of 2022. [see How Popular is Unicode UTF-8]

UTF-8 Encoding Scheme

UTF-8 is a variable width character encoding scheme. Each Codepoint of unicode is encoded to 1 to 4 Bytes.

Codepoint to UTF-8 conversion
First codepoint	Last codepoint	Byte 1	Byte 2	Byte 3	Byte 4
U+0000	U+007F	0xxxxxxx
U+0080	U+07FF	110xxxxx	10xxxxxx
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

in this scheme:

if the byte begins with 0, it means the character is single byte, else multi.
if the byte begins with 1, the count of leading 1 indicates how many bytes for the character.
A byte is the starting of a character, if it does not begin with bits 10.
you can find next/previous starting byte by searching for bytes that does not begin with bits 10.

Unicode: UTF-8 Encoding

What is UTF-8

UTF-8 Encoding Scheme

Unicode and Encoding Explained