Unicode: UTF-8 Encoding
What is UTF-8
UTF-8 stand for Unicode Transformation Format, 8-bit.
ใsee Unicode: Character Set, Encoding, UTF-8, Codepointใ
UTF-8 is compatible with ASCII encoding. Each ASCII Character has the same encoding as UTF-8. This means, their byte sequence is the same, for any file encoded in ASCII or UTF-8. (excluding the case if the file began with a Byte Order Mark.)
UTF-8 is published in 1993. ใsee Unicode UTF8 History, by Rob Pikeใ
UTF-8 is the most popular encoding, used by 98% of all web pages in the world as of 2022. ใsee How Popular is Unicode UTF-8ใ
UTF-8 Encoding Scheme
UTF-8 is a variable width character encoding scheme. Each Codepoint of unicode is encoded to 1 to 4 Bytes.
First codepoint | Last codepoint | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | |||
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
in this scheme:
- if the byte begins with 0, it means the character is single byte, else multi.
- if the byte begins with 1, the count of leading 1 indicates how many bytes for the character.
- A byte is the starting of a character, if it does not begin with bits 10.
- you can find next/previous starting byte by searching for bytes that does not begin with bits 10.
Unicode and Encoding Explained
- Unicode: Character Set, Encoding, UTF-8, Codepoint
- Unicode: Codepoint
- Unicode: Character Name
- ASCII Characters
- Unicode: UTF-8 Encoding
- Unicode: UTF-16 Encoding
- Unicode: Surrogate Pair
- Unicode: Byte Order (Endianness)
- Unicode: BOM, Byte Order Mark
- Set Text Editor File Encoding
- Unicode Letter Character
- Unicode: Variation Selector