Unicode: Character Set, Encoding, UTF-8, Codepoint

By Xah Lee. Date: . Last updated: .

What is a Character Set?

A character set (charset in short) is a fixed collection of symbols. For example, the English alphabet โ€œAโ€ to โ€œZโ€ and โ€œaโ€ to โ€œzโ€ can be a character set, with a total of 52 symbols. (26 uppercase letters, 26 lower case.)

One of the simplest standardized character set is American Standard Code for Information Interchange (aka ASCII) , started from 1960s, and is almost the only one used in USA up to 1990s.

ASCII contains 128 symbols. It includes all the {letters, digits, punctuations} you see on a typical keyboard sold in USA.

ASCII is designed for languages that use English alphabet only.

Here is the complete list of ASCII characters: ASCII Characters

What is Character Encoding?

Character Encoding is the process of translating a character into a sequence of 1 and 0 (a Binary Number), according to a standard table called Encoding System.

Any file has to go thru encoding/decoding in order to be properly stored as file or displayed on screen. Computer needs a way to translate the character set of human language's writing system into a sequence of 1s and 0s. This transformation is called Character Encoding.

There are many encoding systems. The most popular encoding systems used today are:

What Does Character Encoding Standard Need to Define?

A Character Encoding standard, essentially just need to give each character a unique integer ID. This number, is called the Codepoint. This number, then is represented in computer as a Binary Number, normally 8 of them as a unit, thus is a sequence of Bytes.

Character Set and Encoding System

Character Set (aka charset) and Encoding System are different concepts, but often confused together.

In the early days of computing, these two concepts are not clearly made distinct, and are often just called a charset.

For example of the charset and encoding confusion can be seen in HTML stardard, this code:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

The syntax contains the word charset , but it's actually about encoding, not charset. [see HTML: Charset and Encoding]

A encoding standard defines a character set implicitly. Because it needs to define what characters it is designed to handle.

What is Unicode

Unicode stand for Universal Coded Character Set.

Unicode is a standard created by the Unicode Consortium in 1991.

Unicode primarily defines 2 things:

  1. A character set. (which includes the characters needed for all languages.)
  2. Several encoding systems. (most popular are โ€ข UTF-8 โ€ข UTF-16)

Unicode's Character Set

Unicode's character set includes ALL human language's written symbols. It includes the thousands of Chinese characters ไธญๆ–‡, Math Symbols โˆ‘ โˆซ ฯ€ยฒ โˆž, and characters of dead languages, such as Egyptian Hieroglyph ๐“‚€, Rune แš . And, also Emoji ๐Ÿ˜„ .

Codepoint (Character ID)

Character Name

Decoding

When a editor opens a file, it needs to know the encoding system used, in order to decode the sequence of 1 and 0 and map it to fonts to display the original characters properly. In general, the info about the encoding system used for a file is not bundled with the file.

Before internet, there is not much problem because most English speaking world use ASCII, and non-English regions use encoding schemes particular to their regions.

With internet, files in different languages started to exchange a lot. When opening a file, Windows applications may try to guess the encoding system used, by some heuristics. When opening a file in a app that assumed a wrong encoding, typically the result is gibberish. Usually, you can explicitly tell a app to use a particular encoding to open the file. For example, see Emacs File Encoding FAQ. Similarly, when saving a file, there's usually a option for you to specify what encoding to use.

[see Set Text Editor File Encoding]

Font

When a computer has decoded a file, it then needs to display the characters as glyphs on the screen. For our purposes, this set of glyphs is a font. So, your computer now needs to map the Unicode codepoints to glyphs in a font.

For Asian languages, such as Chinese, Japanese, Korean, or languages using Arabic alphabet as its writing system (Arabic, Persian), you also need the proper font to display the file correctly.

[see Best Unicode Fonts for Programer]

Input Method

For languages that are not based on alphabet, such as Chinese, you need a way to โ€œtypeโ€ them. Such a way is called โ€œinput systemโ€ or โ€œinput methodโ€.

See:

Practical Examples

see also

Unicode and Encoding Explained