Unicode BOM Hack

By Xah Lee. Date: . Last updated: .

What is Unicode BOM

BOM is a Hack

Note that using BOM as first char to indicate a byte order is a hack. It's the same trick unix does with the Shebang scheme (i.e. first chars in file to be a #!, followed by a program path, as a way to embed info about where to find the program. So you can call commands without using the interpreter first, for example, processFile.pl fileName instead of perl processFile.pl fileName).

In both case of BOM and shebang, the first few bytes is used as a indicator for some particular meaning.

Similarly, in {Emacs, Python, Ruby}, if the first line has the form -*- coding: utf-8 -*-, it indicates that the file is UTF-8 encoded. Again, it creates a problem. For example, if a python script uses the shebang but is also UTF-8 encoded, what to put in the first line?

The real solution is a metadata file format. [see How to View Comments in JPEG, PNG, MP3 files?] But of course, hacks are created to solve practical problems at hand. Unix Shebang was there before there's Unicode. And Unicode BOM mark usage came before the exploration of matadata file format standards. Even today, metadata file formats are not widely used or standardized. Different file systems used by different OS may or may not have their own schemes for metadata, to various degrees of usage popularity, but they are usually not portable across file systems. As of today (), vast majority of files do not contain info about what encoding it is.

The word “Endianess” is from: Gulliver's Travels: PART 1, Chapter 4 — A VOYAGE TO LILLIPUT .

see also Invisible Character from Twitter