HTML: Character Sets and Encoding

By Xah Lee. Date: . Last updated: .

In HTML, you can declare the Character Set for the file. Here's example of setting it to be UTF-8 (Unicode):

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

If you are using HTML5, you can just say:

<meta charset="utf-8" />

[see Unicode Basics: Character Set, Encoding, UTF-8]

Once you declared your character set, you can have characters from that character set in your HTML file.

UTF-8 (Unicode) contains all the world's language's characters. Here is a sample of characters from Unicode:

© é 😂

[see Unicode Characters ∑ ♥ 😄]

Character Entity

Another way to show special characters in your file is by so-called “character entity”.

Decimal Form

For example, the bullet symbol is Unicode character number 8226. In HTML, you can write it as &#8226;.

Hexadecimal Form

The number 8226 in hexadecimal is 2022. Sometimes you only know the hexadecimal number of a character. You can write it using hexadecimal like this &#x2022;.

Named Form

For some commonly used characters, HTML provides “named entity” for them. For example, the bullet character can be written as &bull;.

For a complete list of named entities, see: HTML/XML Entity List.

HTML/HTTP's Charset is About Encoding, Not Character Set

HTTP's definition of charset (and the charset meta tag in HTML) is actually about character encoding.

Here's a excerpt from [RFC 2616 At ]:

3.4 Character Sets

HTTP uses the same definition of the term “character set” as that described for MIME:

The term “character set” is used in this document to refer to a method used with one or more tables to convert a sequence of octets into a sequence of characters. Note that unconditional conversion in the other direction is not required, in that not all characters may be available in a given character set and a character set may provide more than one sequence of octets to represent a particular character. This definition is intended to allow various kinds of character encoding, from simple single-table mappings such as US-ASCII to complex table switching methods such as those that use ISO-2022's techniques. However, the definition associated with a MIME character set name MUST fully specify the mapping to be performed from octets to characters. In particular, use of external profiling information to determine the exact mapping is not permitted.

Note: This use of the term “character set” is more commonly referred to as a “character encoding.” However, since HTTP and MIME share the same registry, it is important that the terminology also be shared.

If you don't understand what is Character Set and Encoding, see: Unicode Basics: Character Set, Encoding, UTF-8.

What's HTML4 or HTML5's Default Encoding?

By spec, there's no default encoding.

A encoding must came from one of http header, meta tag in html file. If none found, the browser must guess.


HTML Basics

  1. HTML Basics
  2. HTML5 Tags
  3. Case Sensitivity
  4. Allowed Characters
  5. Charset and Encoding
  6. Self-Closing Tags
  7. Multiple Class Value
  8. HTML/XML Entity List

File Encoding

  1. Unicode Basics: Character Set, Encoding, UTF-8, Codepoint
  2. HTML: Character Sets and Encoding
  3. Unicode in Ruby, Perl, Python, JavaScript, Java, Emacs Lisp, Mathematica
  4. Python: Unicode Tutorial 🐍
  5. Python: Convert File Encoding
  6. Python: Convert File Encoding for All Files in a Dir
  7. Perl: Unicode Tutorial 🐪
  8. Perl: Convert File Encoding
  9. Ruby: Unicode Tutorial 💎
  10. Java: Convert File Encoding
  11. Linux: Convert File Encoding with iconv
Liket it? I spend 2 years writing this tutorial. Help me spread it. Tell your friends. Or, Put $5 at patreon.

Or, Buy JavaScript in Depth

If you have a question, put $5 at patreon and message me.

Web Dev Tutorials

  1. HTML
  2. Visual CSS
  3. JS in Depth
  4. JS Object Reference
  5. DOM Scripting
  6. SVG
  7. Blog