Unicode: UTF-8 Encoding
What is UTF-8
UTF-8 stand for Unicode Transformation Format, 8-bit.
[see Unicode Basics: Character Set, Encoding, UTF-8, Codepoint]
UTF-8 is compatible with ASCII encoding. Each ASCII Character has the same encoding as UTF-8. This means, their byte sequence is the same, for any file encoded in ASCII or UTF-8. (excluding the case if the file began with a Byte Order Mark.)
UTF-8 is published in 1993. [see Unicode UTF8 History, by Rob Pike]
UTF-8 is the most popular encoding, used by 98% of all web pages in the world as of 2022. [see How Popular is Unicode UTF-8]
UTF-8 Encoding Scheme
UTF-8 is a variable width character encoding scheme. Each Codepoint of unicode is encoded to 1 to 4 Bytes.
First codepoint | Last codepoint | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | |||
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
in this scheme:
- if the byte begins with 0, it means the character is single byte, else multi.
- if the byte begins with 1, the count of leading 1 indicates how many bytes for the character.
- A byte is the starting of a character, if it does not begin with bits 10.
- you can find next/previous starting byte by searching for bytes that does not begin with bits 10.