Unicode: Character Set, Encoding, UTF-8, Codepoint

By Xah Lee. Date: 2010-06-20. Last updated: 2022-10-23.

What is a Character Set?

A character set (charset in short) is a fixed collection of symbols. For example, the English alphabet “A” to “Z” and “a” to “z” can be a character set, with a total of 52 symbols. (26 uppercase letters, 26 lower case.)

One of the simplest standardized character set is American Standard Code for Information Interchange (aka ASCII) , started from 1960s, and is almost the only one used in USA up to 1990s.

ASCII contains 128 symbols. It includes all the {letters, digits, punctuations} you see on a typical keyboard sold in USA.

ASCII is designed for languages that use English alphabet only.

ASCII does not contain some European language characters such as è ñ.
ASCII does not contain symbols such as { ™ © ♥ • † ∑ « » →}.
ASCII cannot be used for Chinese characters 中文, Arabic alphabet ش, Russian alphabet Ж, etc.

Here is the complete list of ASCII characters: ASCII Characters

What is Character Encoding?

Character Encoding is the process of translating a character into a sequence of 1 and 0 (a Binary Number), according to a standard table called Encoding System.

Any file has to go thru encoding/decoding in order to be properly stored as file or displayed on screen. Computer needs a way to translate the character set of human language's writing system into a sequence of 1s and 0s. This transformation is called Character Encoding.

There are many encoding systems. The most popular encoding systems used today are:

UTF-8 (used by 98% of website files on Internet as of year 2022)
UTF-16
GB 18030 (Used in China, contains all Unicode chars). [see Chinese Encoding, Introduction]
EUC (Extended Unix Code). Used in Japan.
ASCII. For English. Most widely used before year 2000. Compatible with UTF-8.
IEC 8859 series (used for most European langs before year 2000)

What Does Character Encoding Standard Need to Define?

A Character Encoding standard, essentially just need to give each character a unique integer ID. This number, is called the Codepoint. This number, then is represented in computer as a Binary Number, normally 8 of them as a unit, thus is a sequence of Bytes.

Character Set and Encoding System

Character Set (aka charset) and Encoding System are different concepts, but often confused together.

A Character Set is just a set of characters.
A encoding system is a way to transform a sequence of characters (of a given charset) into a sequence of Bytes.

In the early days of computing, these two concepts are not clearly made distinct, and are often just called a charset.

For example of the charset and encoding confusion can be seen in HTML stardard, this code:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

The syntax contains the word charset , but it's actually about encoding, not charset. [see HTML: Charset and Encoding]

A encoding standard defines a character set implicitly. Because it needs to define what characters it is designed to handle.

What is Unicode

Unicode stand for Universal Coded Character Set.

Unicode is a standard created by the Unicode Consortium in 1991.

Unicode primarily defines 2 things:

A character set. (which includes the characters needed for all languages.)
Several encoding systems. (most popular are • UTF-8 • UTF-16)

Unicode's Character Set

Unicode's character set includes ALL human language's written symbols. It includes the thousands of Chinese characters 中文, Math Symbols ∑ ∫ π² ∞, and characters of dead languages, such as Egyptian Hieroglyph 𓂀, Rune ᚠ. And, also Emoji 😄 .

Codepoint (Character ID)

What is Codepoint

Character Name

Unicode Character Name

Decoding

When a editor opens a file, it needs to know the encoding system used, in order to decode the sequence of 1 and 0 and map it to fonts to display the original characters properly. In general, the info about the encoding system used for a file is not bundled with the file.

Before internet, there is not much problem because most English speaking world use ASCII, and non-English regions use encoding schemes particular to their regions.

With internet, files in different languages started to exchange a lot. When opening a file, Windows applications may try to guess the encoding system used, by some heuristics. When opening a file in a app that assumed a wrong encoding, typically the result is gibberish. Usually, you can explicitly tell a app to use a particular encoding to open the file. For example, see Emacs File Encoding FAQ. Similarly, when saving a file, there's usually a option for you to specify what encoding to use.

[see Set Text Editor File Encoding]

Font

When a computer has decoded a file, it then needs to display the characters as glyphs on the screen. For our purposes, this set of glyphs is a font. So, your computer now needs to map the Unicode codepoints to glyphs in a font.

For Asian languages, such as Chinese, Japanese, Korean, or languages using Arabic alphabet as its writing system (Arabic, Persian), you also need the proper font to display the file correctly.

[see Best Unicode Fonts for Programer]

Input Method

For languages that are not based on alphabet, such as Chinese, you need a way to “type” them. Such a way is called input system or input method.

See: