Unicode BOM Byte Order Mark Hack
Facts of the Unicode's Byte Order Mark (BOM) character:
- The BOM char's name is ZERO WIDTH NO-BREAK SPACE.
- The BOM char codepoint is
65279
in decimal,U+FEFF
in hexadecimal. - The primary purpose of BOM is to indicate byte-order (big endian vs little endian) in systems or situations that need this info in the file.
- BOM is not needed for files encoded with UTF-8, since the smallest unit of UTF-8 encoding is a byte, so doesn't have the byte-order issue.
- When used in UTF-8, it just give a indication that it is a file encoded with one of Unicode encodings (For example, UTF-8, UTF-16, UTF-32).
- You should not add BOM in UTF-8 encoded files.
- The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252, it will display it like this:

.
- In unix-like OSes, BOM for UTF-8 conflicts with the unix shebang line hack. (That is, the
#!/usr/bin/python
in first line. [see Unix/Linux Shell Shebang: Who Gets to Use the First Char?]) - Some software from Microsoft (For example, Notepad, Visual C++), will add BOM when a file is saved using UTF-8 encoding.
- The BOM char's use as a zero-width no-breaking space is deprecated since Unicode 3.2 (published in 2002). That char's semantic is now for BOM only. “U+2060” (WORD JOINER) is now used for non-breaking space.
See also:
- Frequently Asked Questions: UTF-8, UTF-16, UTF-32 and BOM By Unicode Consortium. At http://www.unicode.org/faq/utf_bom.html
- [ Byte Order Mark ] [ https://en.wikipedia.org/wiki/Byte_Order_Mark ]
BOM is a Hack
Note that using BOM as first char to indicate a byte order is a hack. It's the same trick unix does with the Shebang scheme (i.e. first chars in file to be a
#!
, followed by a program path, as a way to embed info about where to find the program. So you can call commands without using the interpreter first, for example, processFile.pl fileName
instead of perl processFile.pl fileName
).
In both case of BOM and shebang, the first few bytes is used as a indicator for some particular meaning.
Similarly, in {Emacs, Python, Ruby}, if the first line has the form -*- coding: utf-8 -*-
, it indicates that the file is UTF-8 encoded. Again, it creates a problem. For example, if a python script uses the shebang but is also UTF-8 encoded, what to put in the first line?
The real solution is a metadata file format. [see How to View Comments in JPEG, PNG, MP3 files?] But of course, hacks are created to solve practical problems at hand. Unix Shebang was there before there's Unicode. And Unicode BOM mark usage came before the exploration of matadata file format standards. Even today, metadata file formats are not widely used or standardized. Different file systems used by different OS may or may not have their own schemes for metadata, to various degrees of usage popularity, but they are usually not portable across file systems. As of today (), vast majority of files do not contain info about what encoding it is.
The word “Endianess” is from: Gulliver's Travels: PART 1, Chapter 4 — A VOYAGE TO LILLIPUT .
see also Invisible Character from Twitter