Unicode: BOM, Byte Order Mark
What is Byte Order Mark
Byte Order Mark (BOM, U+FEFF) is a character in Unicode.
- Unicode Character Name: ZERO WIDTH NO-BREAK SPACE
- old-name: BYTE ORDER MARK
- Codepoint: 65279
- Codepoint in hexadecimal: FEFF
- UTF-8 Encoding: EF BB BF
- UTF-16 Encoding: FEFF
Purpose of Byte Order Mark
The BOM character is a invisible character, designed to be placed at the beginning of file, optionally, for 2 purposes:
- To indicate that this file is encoded in Unicode. (known as Unicode Signature)
- To indicate the byte-order of the file. [see Unicode: Byte Order (Endianness)]
BOM and UTF-8
- BOM is not needed for files encoded with UTF-8, since the smallest unit of UTF-8 encoding is a byte, so doesn't have the byte-order issue.
- UTF-8 is easy to detect due to the byte pattern, so no need Unicode Signature neither
- When used in UTF-8, it just give a indication that the file is encoded with Unicode encoding.
- Adding BOM makes the file incompatible with ASCII.
- In unix-like Operating systems, BOM inteferes with the unix shebang line hack.
ZERO-WIDTH NO-BREAKING SPACE vs WORD JOINER
The BOM char's use as a zero-width no-breaking space is deprecated since Unicode 3.2 (published in 2002). That char's semantic is now for BOM only. “U+2060” (WORD JOINER) is now used for non-breaking space.
Other tips of BOM:
- The UTF-8 representation of the BOM is the byte sequence
EF BB BF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252, they display it like this:
- Some software from Microsoft (e.g. Notepad, Visual C++), add BOM when a file is saved using UTF-8 encoding.
- Frequently Asked Questions: UTF-8, UTF-16, UTF-32 and BOM By Unicode Consortium. At http://www.unicode.org/faq/utf_bom.html