Unicode Basics: Character Set, Encoding, UTF-8, Codepoint

By Xah Lee. Date: . Last updated: .

What's a Character Set?

A character set is a fixed collection of symbols. For example, the English alphabet “A” to “Z” and “a” to “z” can be a character set, with a total of 52 symbols. (26 uppercase letters, 26 lower case.)

One of the simplest standardized character set is “ASCII”, started from 1960s, and is almost the only one used in USA up to 1990s. (ASCII = American Standard Code for Information Interchange).

ASCII contains 128 symbols. It includes all the {letters, digits, punctuations} you see on a typical keyboard sold in USA.

ASCII is designed for languages that use English alphabet only.

Here's the complete list of ASCII characters: ASCII Table

What's Character Encoding?

Any file has to go thru encoding/decoding in order to be properly stored as file or displayed on screen. Your computer needs a way to translate the character set of your language's writing system into a sequence of 1s and 0s. This transformation is called Character Encoding.

There are many encoding systems. The most popular encoding systems used today are:

Character Set and Encoding System

Character Set and Encoding System are different concepts, but often confused together.

In the early days of computing, these two concepts are not clearly made distinct, and are just called a char set or encoding system.

For example of the charset and encoding confusion can be seen in HTML stardard, this code:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

The syntax contains the word “charset”, but it's actually about encoding, not charset. [see HTML: Character Sets and Encoding]

A encoding standard defines a character set implicitly. Because it needs to define what characters it is designed to handle.

Unicode's Character Set and Encoding Systems

Unicode is a standard created by the Unicode Consortium in 1991.

Unicode primarily defines 2 things:

  1. A character set. (which includes the characters needed for all languages.)
  2. Several encoding systems. (most popular are UTF-8, UTF-16)

Unicode's Character Set

Unicode's character set includes ALL human language's written symbols. It includes the tens of thousands Chinese characters, math symbols, as well as characters of dead languages, such as Egyptian Hieroglyph 𓂀. And, also “emoji”. [see Unicode Emoji 😄]

For unicode gallery and search box, see Unicode Search 💋 ♥ 😄

Codepoint

Each character in Unicode is given a unique ID. This id is a number (integer), starting at 0, and is called the char's code point.

(You can think of code point as “character ID”. It's not called “character id”, because some “character” are not really “character”, such as space, return, tab, right-to-left marker, etc.)

Code point is represented either in decimal or hexadecimal.

Example:

Standard Notation for Codepoint

The standard notation for codepoint is “U+” followed by its codepoint in hexadecimal. e.g.

U+3B1

Character Name

A unique name is given to each unicode codepoint.

Examples:

Note: a unicode codepoint may have 1 or more old names, due to character's name change in the early days, in Unicode version 2 in 1996. However, a name still uniquely refer to 1 codepoint.

Unicode's Encoding Systems: UTF-8, UTF-16, etc.

Then, Unicode defines several encoding system. UTF-8 and UTF-16 are the two most popular Unicode encoding systems. Each encoding system has advantages and disadvantages.

UTF-8 is suitable for texts that are mostly English letters. For example, English, Spanish, French, and most web technology such as HTML, CSS, JavaScript.

Most Linux's files are in UTF-8 by default. UTF-8 encoding system is backwards compatible with ASCII. (meaning: If a file only contain characters of ASCII, then encoding the file using UTF-8 results the same byte sequence as using ASCII as encoding scheme.)

UTF-16 is another coding system from Unicode. With UTF-16, every char is encoded into 2 or more bytes, and commonly used characters in Unicode are exactly 2 bytes. For Asian languages containing lots of Chinese characters, such as Chinese and Japanese, UTF-16 creates smaller file size.

There's also UTF-32, which always uses 4 bytes per character. It creates larger file size, but is simpler to parse. Currently, UTF-32 is not being used much.

Decoding

When a editor opens a file, it needs to know the encoding system used, in order to decode the binary stream and map it to fonts to display the original characters properly. In general, the info about the encoding system used for a file is not bundled with the file.

Before internet, there's not much problem because most English speaking world use ASCII, and non-English regions use encoding schemes particular to their regions.

With internet, files in different languages started to exchange a lot. When opening a file, Windows applications may try to guess the encoding system used, by some heuristics. When opening a file in a app that assumed a wrong encoding, typically the result is gibberish. Usually, you can explicitly tell a app to use a particular encoding to open the file. (For example, in web browsers, usually there's a menu. In Firefox, under View, Character Encoding.) Similarly, when saving a file, there's usually a option for you to specify what encoding to use. For example, in Microsoft Notepad, when you save a file, there's a “Encoding” menu at the bottom of the Save dialog.

Font

When a computer has decoded a file, it then needs to display the characters as glyphs on the screen. For our purposes, this set of glyphs is a font. So, your computer now needs to map the Unicode code points to glyphs in a font.

For Asian languages, such as Chinese, Japanese, Korean, or languages using Arabic alphabet as its writing system (Arabic, Persian), you also need the proper font to display the file correctly.

Input Method

For languages that are not based on alphabet, such as Chinese, you need a way to “type” them. Such a way is called “input system” or “input method”.

See:

What's the Most Popular Encoding?

UTF-8 is used for 92% of web pages the world.

Unicode Popularity: How Popular is UTF-8?

UTF-16 is used in Java programing languages, and Mac's HFS Plus file system and Microsoft Windows NTFS file system.

GB 18030 is used in entire China. (UTF-8 is also use in some Chinese websites.)

See also: What Character Encoding Do Chinese Sites Use?

For more detail, see: [General questions, relating to UTF or Encoding Form By Unicode Consortium. At http://www.unicode.org/faq/utf_bom.html , accessed on 2014-04-09 ]

File Encoding

  1. Unicode Basics: Character Set, Encoding, UTF-8, Codepoint
  2. HTML: Character Sets and Encoding
  3. Math Symbol as Function Names, Python, JavaScript, Java
  4. Python: Unicode Tutorial 🐍
  5. Python: Convert File Encoding
  6. Python: Convert File Encoding for All Files in a Dir
  7. Perl: Unicode Tutorial 🐪
  8. Perl: Convert File Encoding
  9. Ruby: Unicode Tutorial 💎
  10. Java: Convert File Encoding
  11. Linux: Convert File Encoding with iconv

If you have a question, put $5 at patreon and message me.

  1. Emoji 😂
  2. Hand 👍
  3. Food 🍎
  4. Love 💋
  5. Clothing 👠
  6. Animal 🐰
  7. Insect 🐞
  8. Plant 🌵
  9. Sport
  10. Astrology 🌛
  11. Weather 🌧
  12. Place 🎪
  13. Signs
  14. Vehicle 🚀
  15. Things 🔪
  16. Tech 🎧
  17. Office 📧
  18. UI 🗑
  19. Clock
  20. Music 🎶
  21. Flags 🏁
  22. Sex
  23. Stars
  24. Cross
  25. Games
  26. Shapes
  27. Box
  28. Dingbats
  29. Keyboard
  30. Common ©
  31. Marks
  32. Money
  33. Circled
  34. Arrow
  35. Bracket «»
  36. Math
  37. Math Font
  38. Greek α
  39. APL
  40. ASCII
  41. Unit
  42. Braille
  43. Cyrillic Ж
  44. Chinese
  45. full-width
  46. Japanese
  47. Korean
  48. Arabic ش
  49. Ethiopic
  50. Devanagari
  51. Bengali
  52. Tamil
  53. Tibetan
  54. Hieroglyph 𓂀
  55. Cuneiform 𒁷
  56. Linear B 𐂂
  57. Phoenician 𐤈
  58. Rune

How To

  1. How to Input Unicode
  2. Download Font

Versions

  1. Unicode 11
  2. Unicode 10
  3. Unicode 9
  4. Unicode 8
  5. Unicode 7

Art

  1. Japan Unicode Art
  2. Unicode Art
  3. Unicode Art Font Emulation
  4. Emoji Tale
  5. Unicode Smilies (¬_¬)
  6. Emoji Frequency

Misc

  1. Unicode for Programers
  2. emoji vs emoticon
  3. Unicode Equivalence
  4. Symbol Become Emoji