Linux Hacker Propaganda on UTF-8 Encoding

By Xah Lee. Date: . Last updated: .

2022-10-25 This page is archived here for historical purposes. My opinion on UTF-8 is completely wrong. UTF-8 is indeed, far better than UTF-16.

this video, a propaganda on UTF-8, spread by unix dweebs all over the web in 2013.

Characters, Symbols and the Unicode Miracle — Computerphile

Note words: clever, hack, designed on a napkin. These are the words that tickle the hacker types, the tell-tale words of propaganda and garbage. The words of FUD of the unix fanatics.

The first 4:30 minutes is pure puerile drivel. A verbiage on the background of ASCII and encoding.

The rest of the video exhorts how UTF-8 is the miracle encoding. (If you don't know what Unicode and UTF-8 is, first see: Unicode Basics: Character Set, Encoding, UTF-8)

The one thing i learned from this video, is how UTF-8 indicates how many bytes is a char. It's done by starting each byte by 1. This is about the only technical point made by this video.

according to him, this UTF-8 design is the most beautiful. I don't see this qualifies. Because, the 1st bit (the left most bit), is used for other purposes. Now, you inject meaning to that 1st bit. That's not gonna be compatible with all the old stream protocols, unless you introduce another rule to indicate this. But then, it's no different that using new encoding such as UTF-16.

in summary:

• UTF-8 is backwards compatible with ASCII. This is great. However, as computer tech moves on, backward compatibility is not necessarily a good thing, as it is often the obstacle of progress.

• the guy doesn't consider Asian's needs and encoding. For example, in each of Japan, China, Taiwan, they have used their own encoding, and still do today, without problems. In the case of China's GB 18030, which is Unicode compatible (has all chars of unicode). They use these encoding happily, no problem. (see: Chinese Websites Character Encoding Survey, Year 2012)

There is no absolute advantage of UTF-8 over UTF-16 or UTF-32. Each has merits. The point of this discussion, is about whether there is one encoding that's absolutely superior in general. The video gives a impression that it's UTF-8. But falls short in providing relevant evidence. Windows adopted UTF-16 as internal since Windows NT (~1995), as did mac os x's file system (~2001), as did Java source code. And China's GB 18030 in use for over a decade. Also note, data storage capacity is increasing at exponential rate. A image is a hundred times bigger than text, and videos, as at YouTube, each is hundreds times bigger than a image. UTF-32 takes more storage and more memory, but is simpler to process. So, it's the classic problem of optimization issue. Do you want to optimize cpu usage or storage space? The video does not cover this at all.

Again, UTF-8 is loved by unixers/linuxers because all unix tools deal with ASCII only, and it is easy to patch them to deal with UTF-8. This is the thrust why unixers love to spread this video, thinking that UTF-8 is superior to all.

google plus discussion https://plus.google.com/+XahLee/posts/YVCrF7VUgoy