Semantic of Symbols: HTML Entities, Ampersand, Unicode

By Xah Lee. Date: . Last updated: .

This article is some thoughts on semantics of symbols.

Semantic Differences in Symbols of Identical Appearance

In my computing tutorials, i use this scheme to indicate menu: “File ▸ New ▸ Folder”. (For example, Second Life Keyboard Shortcuts Cheatsheet.)

The triangle Unicode char i was using is named TRIANGULAR BULLET . I found a better symbol for it recently, the BLACK RIGHT-POINTING SMALL TRIANGLE .

Even both chars look similar, but they have semantic differences. The char i was using is a meant to be a bullet. The purpose of bullet symbol is to indicate a item in a list. What i needed isn't a item indicator, but a indicator for a node in a tree. A common symbol for this purpose is a right-pointing triangle. So, BLACK RIGHT-POINTING SMALL TRIANGLE is a better choice.

Ampersand, HTML Entities

In HTML , the ampersand char & needs to be encoded as &. In early HTML specs, that char needs to be encoded always, but i think in HTML4 the grammar is changed so that if the char is surrounded in spaces then it doesn't need to be encoded. (the rule is quite complex actually, especially when the char is in URL. See: URL Percent Encoding and Ampersand Char. ) In XML, the ampersand always needs to be encoded, unlike that of HTML4.

Meaning of the Ampersand

The ampersand char as a English punctuation means “and”, however, there are subtleties. (read Wikipedia article here ampersand, quite interesting story on etymology) In my own writings, i sometimes use the symbol, as in the article title Scheme Lisp and Failure. In longer sentences, sometimes you use it instead of “and” because using the word “and” introduces too many conjunctions in the sentence, but a glyph makes the grammatical structure more clear. For example, look at this sentence: “grep & glob mutates into egrep & fgrep confoundedness”. (from Unix and the mbox Email Format ) Here, you can see the “&” acts as a tight connecting operator. Its meaning is slightly different than the more general connective “and”. For the same reason, company names stick with the symbol too, for example, “AT & T”, “Bang & Olufsen”, “Johnson & Johnson” or law firms “Baker & McKenzie”.

Recently, while working on the page Unicode: Punctuations • ✓ ™ , i discovered several variant Unicode characters for the ampersand.

Being a fanatic about symbols, notation, syntax, elegance, i really hate the entities in HTML. The need to encode & as & introduces several complexities. It's more difficult to parse, makes find & replace or grep more complex, more difficult in typing it. For example, you want to find all occurrences of & (surrounded by spaces), now you need to search for both & and &.

So, i've been toying with the idea of replacing any & with a Unicode variant, the FULLWIDTH AMPERSAND . This way, you don't have to deal with the HTML entity. The SGML/HTML/XML character entity was created because some decade ago, Unicode wasn't there. You only have about 100 ASCII chars to work with. So was invented “encoding” and “character entity”. [see Problems of Symbol Congestion in Computer Languages; ASCII Jam vs Unicode]

Here is a interesting video from Google about how search engines deal with less used Unicode chars.

How does Google handle ligatures, soft-hyphens, interpuncts and hyphenation points?
How does Google handle special characters?
Nov 1, 2010
Google Webmasters

Unicode, Encoding, Escape Sequence, Issues