Unicode: BOM, Byte Order Mark

By Xah Lee. Date: 2022-10-22. Last updated: 2022-10-24.

What is Byte Order Mark

Byte Order Mark (BOM, U+FEFF) is a character in Unicode.

Unicode: Character Name: ZERO WIDTH NO-BREAK SPACE
old-name: BYTE ORDER MARK
Codepoint: 65279
Codepoint in hexadecimal: FEFF
UTF-8 Encoding: EF BB BF
UTF-16 Encoding: FEFF

Purpose of Byte Order Mark

The BOM character is a invisible character, designed to be placed at the beginning of file, optionally, for 2 purposes:

To indicate that this file is encoded in Unicode. (known as Unicode Signature)
To indicate the byte-order of the file. [see Unicode: Byte Order (Endianness)]

BOM and UTF-8

BOM is not needed for files encoded with UTF-8, since the smallest unit of UTF-8 encoding is a byte, so doesn't have the byte-order issue.
UTF-8 is easy to detect due to the byte pattern, so no need Unicode Signature neither
When used in UTF-8, it just give a indication that the file is encoded with Unicode encoding.
Adding BOM makes the file incompatible with ASCII.
In unix-like Operating systems, BOM inteferes with the unix shebang line hack.

ZERO-WIDTH NO-BREAKING SPACE vs WORD JOINER

The BOM char's use as a zero-width no-breaking space is deprecated since Unicode 3.2 (published in 2002). That char's semantic is now for BOM only. “U+2060” (WORD JOINER) is now used for non-breaking space.

Other tips of BOM:

The UTF-8 representation of the BOM is the byte sequence EF BB BF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252, they display it like this: ï»¿.

Some software from Microsoft (e.g. Notepad, Visual C++), add BOM when a file is saved using UTF-8 encoding.

Reference