Unicode: UTF-8 Encoding
What is UTF-8
UTF-8 stand for Unicode Transformation Format, 8-bit.
[see Unicode: Character Set, Encoding, UTF-8, Codepoint]
UTF-8 is compatible with ASCII encoding. Each ASCII Character has the same encoding as UTF-8. This means, their byte sequence is the same, for any file encoded in ASCII or UTF-8. (excluding the case if the file began with a Byte Order Mark.)
UTF-8 is published in 1993. [see Unicode UTF8 History, by Rob Pike]
UTF-8 is the most popular encoding, used by 98% of all web pages in the world as of 2022. [see How Popular is Unicode UTF-8]
UTF-8 Encoding Scheme
UTF-8 is a variable width character encoding scheme. Each Codepoint of unicode is encoded to 1 to 4 Bytes.
|First codepoint||Last codepoint||Byte 1||Byte 2||Byte 3||Byte 4|
in this scheme:
- if the byte begins with 0, it means the character is single byte, else multi.
- if the byte begins with 1, the count of leading 1 indicates how many bytes for the character.
- A byte is the starting of a character, if it does not begin with bits 10.
- you can find next/previous starting byte by searching for bytes that does not begin with bits 10.