Unicode: UTF-8 Encoding

By Xah Lee. Date: . Last updated: .

What is UTF-8

UTF-8 stand for Unicode Transformation Format, 8-bit.

[see Unicode: Character Set, Encoding, UTF-8, Codepoint]

UTF-8 is compatible with ASCII encoding. Each ASCII Character has the same encoding as UTF-8. This means, their byte sequence is the same, for any file encoded in ASCII or UTF-8. (excluding the case if the file began with a Byte Order Mark.)

UTF-8 is published in 1993. [see Unicode UTF8 History, by Rob Pike]

UTF-8 is the most popular encoding, used by 98% of all web pages in the world as of 2022. [see How Popular is Unicode UTF-8]

UTF-8 Encoding Scheme

UTF-8 is a variable width character encoding scheme. Each Codepoint of unicode is encoded to 1 to 4 Bytes.

Codepoint to UTF-8 conversion
First codepointLast codepointByte 1Byte 2Byte 3Byte 4
U+0000U+007F0xxxxxxx
U+0080U+07FF110xxxxx10xxxxxx
U+0800U+FFFF1110xxxx10xxxxxx10xxxxxx
U+10000U+10FFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

in this scheme:

Unicode and Encoding Explained