MathCurvesSurfacesWallpaper GroupsGallerySoftwarePOV-Ray
ProgramingLinuxPerl PythonHTMLCSSJavaScriptPHPJavaEmacsUnicode ♥
Web Hosting by 1&1

Semantic of Symbols: HTML Entities, Ampersand, Unicode

Xah Lee, ,

This article is some thoughts on semantics of symbols.

Semantic Differences in Symbols of Identical Appearance

I write a lot essasy and tutorials related to computing. Often, they include instructions on pulling menus in software. For example, i would write: use the menu 〖File ▸ New ▸ Folder〗. (⁖ Second Life Keyboard Shortcuts Cheatsheet.)

I needed a consistent syntax to indicate the menu hierarchy. Notice that i've used a small right pointing triangle there. The Unicode char i was using is named TRIANGULAR BULLET . I found a better symbol for it recently, the BLACK RIGHT-POINTING SMALL TRIANGLE . So, i took 5 min in emacs to do the change to 5k files on my site. (There are 353 occurrences in 62 files.)

Even both chars look the same, but they have semantic differences. The char i was using is a meant to be a bullet, whose purposes is to indicate a item in a list. What i needed isn't a item indicator, but a indicator for a node in a tree. A common symbol for this purpose is a right-pointing triangle. So, BLACK RIGHT-POINTING SMALL TRIANGLE is a better choice.

Ampersand, HTML Entities

On a related topic… you know how in HTML, the ampersand char & needs to be encoded as &. In early HTML specs, that char needs to be encoded always, but i think in HTML4 the grammar is changed so that if the char is surrounded in spaces then it doesn't need to be encoded. (the rule is quite complex actually, especially when the char is in URL. See: URL Percent Encoding and Ampersand Char. ) In XML, the ampersand always needs to be encoded, unlike that of HTML4.

Meaning of the Ampersand

The ampersand char as a english punctuation means “and”, however, there are subtleties. (read Wikipedia article here ampersand, quite interesting story on etymology) In my own writings, i sometimes use the symbol, as in the article title Scheme & Failure. In longer sentences, sometimes you use it instead of “and” because using the word “and” introduces too many conjunctions in the sentence, but a glyph makes the grammatical structure more clear. For example, look at this sentence: “grep & glob mutates into egrep & fgrep confoundedness”. (Unix and the mbox Email Format) Here, you can see the “&” acts as a tight connecting operator. Its meaning is slightly different than the more general connective “and”. For the same reason, company names stick with the symbol too, ⁖ “AT & T”, “Bang & Olufsen”, “Johnson & Johnson” or law firms “Baker & McKenzie”.

With my recent Unicode work (see: Punctuation Symbols in Unicode), i discovered several variant Unicode char for the ampersand.

Being a fanatic about symbols, notation, syntax, elegance, i really hate the entities in HTML. The need to encode & as & introduces several complexities. It's more difficult to parse, makes find & replace or grep more complex, more difficult in typing it. For example, you want to find all occurrences of & (surrounded by spaces), now you need to search for both & and & .

So, i've been toying with the idea of replacing any & with a Unicode variant, the FULLWIDTH AMPERSAND . This way, you don't have to deal with the HTML entity. The SGML/HTML/XML character entity was created because some decade ago, Unicode wasn't there. You only have about 100 ASCII chars to work with. So was invented “encoding” and “character entity”.

Here's a interesting video from Google about how search engines deal with less used Unicode chars.

“How does Google handle ligatures, soft-hyphens, interpuncts and hyphenation points?”

See also: Problems of Symbol Congestion in Computer Languages; ASCII Jam vs Unicode.

For articles on the semantic of other Unicode characters, see:

blog comments powered by Disqus