Unicode: BOM, Byte Order Mark
What is Byte Order Mark
Byte Order Mark (BOM, U+FEFF) is a character in Unicode.
- Unicode: Character Name: ZERO WIDTH NO-BREAK SPACE
- old-name: BYTE ORDER MARK
- Codepoint: 65279
- Codepoint in hexadecimal: FEFF
- UTF-8 Encoding: EF BB BF
- UTF-16 Encoding: FEFF
Purpose of Byte Order Mark
The BOM character is a invisible character, designed to be placed at the beginning of file, optionally, for 2 purposes:
- To indicate that this file is encoded in Unicode. (known as Unicode Signature)
- To indicate the byte-order of the file. ใsee Unicode: Byte Order (Endianness)ใ
BOM and UTF-8
- BOM is not needed for files encoded with UTF-8, since the smallest unit of UTF-8 encoding is a byte, so doesn't have the byte-order issue.
- UTF-8 is easy to detect due to the byte pattern, so no need Unicode Signature neither
- When used in UTF-8, it just give a indication that the file is encoded with Unicode encoding.
- Adding BOM makes the file incompatible with ASCII.
- In unix-like Operating systems, BOM inteferes with the unix shebang line hack.
ZERO-WIDTH NO-BREAKING SPACE vs WORD JOINER
The BOM char's use as a zero-width no-breaking space is deprecated since Unicode 3.2 (published in 2002). That char's semantic is now for BOM only. โU+2060โ (WORD JOINER) is now used for non-breaking space.
Other tips of BOM:
- The UTF-8 representation of the BOM is the byte sequence
EF BB BF
. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252, they display it like this:รฏยปยฟ
.
- Some software from Microsoft (e.g. Notepad, Visual C++), add BOM when a file is saved using UTF-8 encoding.
Reference
- Frequently Asked Questions: UTF-8, UTF-16, UTF-32 and BOM By Unicode Consortium. At http://www.unicode.org/faq/utf_bom.html
Unicode and Encoding Explained
- Unicode: Character Set, Encoding, UTF-8, Codepoint
- Unicode: Codepoint
- Unicode: Character Name
- ASCII Characters
- Unicode: UTF-8 Encoding
- Unicode: UTF-16 Encoding
- Unicode: Surrogate Pair
- Unicode: Byte Order (Endianness)
- Unicode: BOM, Byte Order Mark
- Set Text Editor File Encoding
- Unicode Letter Character
- Unicode: Variation Selector