Unicode BOM Byte Order Mark Hack

, , …,

Some notes on Unicode's Byte Order Mark (BOM) character:

See also:

BOM is a Hack

Note that using BOM as first char to indicate a byte order is a hack. It's the same trick unix does with the Shebang scheme (i.e. first chars in file to be a #!, followed by a program path, to indicate that the text file is a executable program. So you can call commands without using the interpreter first, ⁖ processFile.pl fileName instead of perl processFile.pl fileName).

In both case of BOM and shebang, the first few bytes is used as a indicator for some particular meaning.

Similarly, in {Emacs, Python, Ruby}, if the first line has the form -*- coding: utf-8 -*-, it indicates that the file is UTF-8 encoded. Again, it creates a problem. For example, if a python script uses the shebang but is also UTF-8 encoded, what to put in the first line?

The real solution is a metadata file format. 〔➤ How to View Comments in JPEG, PNG, MP3 files?〕 But of course, hacks are created to solve practical problems at hand. Unix Shebang was there before there's Unicode. And Unicode BOM mark usage came before the exploration of matadata file format standards. Even today, metadata file formats are not widely used or standardized. Different file systems used by different OS may or may not have their own schemes for metadata, to various degrees of usage popularity, but they are usually not portable across file systems. As of today (), vast majority of files do not contain info about what encoding it is.

The word “Endianess” is from: Gulliver's Travels: PART 1, Chapter 4 — A VOYAGE TO LILLIPUT.

see also Annoying Invisible ZERO WIDTH NO-BREAK SPACE Character from Google Plus, Twitter

blog comments powered by Disqus