Problems of Symbol Congestion in Computer Languages; ASCII Jam vs Unicode

, , …,

Vast majority of computer languages use ASCII as its character set. This means, it jams multitude of operators into about 20 symbols. Often, a symbol has multiple meanings depending on context. Also, a sequence of chars are used as a single symbol as a workaround for lack of symbols. Even for languages that use Unicode as its char set (⁖ Java, XML), often still use the β‰ˆ20 ASCII symbols for all its operators. The only exceptions i know of are Mathematica, Fortress, APL. This page gives example and problems of symbol congestion.

Mathematica reverse list-2-2
Mathematica Wolfram Language. γ€”β˜›Β Math Typesetting, Mathematica, MathML〕 γ€”β˜›Β In-place Algorithm for Reversing a List in Perl, Python, Lisp, Mathematica〕 γ€”β˜›Β Mathematica vs Lisp Syntax〕

Symbol Congestion Examples

Multiple Meanings of a Symbol

Here are some common examples of a symbol that has multiple meanings depending on context:

In Java, the SQUARE BRACKET [ ] is use for declaring array type main(String[] args). Also, part of syntax for array initiation myArray = new int[10];. Also, a delimiter for getting a element of array myArray[i].

In Java and most other languages, PARENTHESIS ( ) is used for expression grouping (x + y) * z, also as delimiter for arguments of a function call System.out.print(x), also as delimiters for parameters of a function's declaration main(String[] args).

In C, Perl, and others, COLON : is used as a separator in a ternary expression (⁖ (test ? "yes" : "no")), also as a namespace separator (⁖ use Data::Dumper;).

In URL, SOLIDUS / is used as path separator, also as indicator of protocol. ⁖ http://example.org/comp/unicode.html

In Python and many others, LESS-THAN SIGN β€œ<” is used for β€œless than” boolean operator, but also as a alignment flag in its β€œformat” method, also as a delimiter of named group in regex, and also as part of char in other operators that are made of 2 chars, ⁖ {<< <= <<= <>}.

The above are just some examples to illustrate the issue. There are perhaps 100 times more.

Examples of Multi-Char Operators

Here are some common examples of operators that are made of multiple characters:

Problems of Symbol Congestion

The tradition of sticking to the 95 chars in ASCII of 1960s is extremely limiting. It creates complex problems manifested in:

String Escape Mechanism

String Escape mechanism, for example, C's backslash {\n, \r, \t, \/, …}, widely adopted. A better solution would be Unicode symbols for unprintable chars. Example candidates:

(Note: string escape mechanism is ultimately necessary, but using proper Unicode can alleviate 99% of the need. γ€”β˜›Β Unicode: Character Representation, ASCII Character Symbols ␀ ␀ ␠ ␣ ΒΆ ↩ �〕)

Crazy Leaning Toothpicks Syndrome

The backslash string escape mechanism directly leads to crazy leaning toothpicks syndrome, especially bad in emacs regex. γ€”β˜›Β Emacs regex tutorial〕 Example:

"<img src=\"\\([^\"]+\\)\" alt=\"\\([^\"]+\\)\" width=\"\\([0-9]+\\)\" height=\"\\([0-9]+\\)\">"

Here's a example of a good design, when Unicode characters are used as meta-characters:

β€œ<img src="γ€”γ€Œ^"」+〕" alt="γ€”γ€Œ^"」+〕" width="γ€”γ€Œ0-9」+〕" height="γ€”γ€Œ0-9」+〕">”

Confusing Context Sensitive Symbols

This is particularly bad in regex. For example, ^ has multiple meanings depending on where it is placed. If in the beginning, it's a line beginning marker, if as first char inside square brackets ⁖ [^…] then it's a negation, otherwise it's literal.

Many other regex chars also have special meaning, some depends on their position. ⁖ ^ $ ? | . + \ - { } ( ) [ ] ….

Whether a symbol's meaning is literal, or whether their position changes meaning, or wether meaning is changed inside [], is completely ad hoc.

Complex Delimiters for Strings

The lack of bracketing symbols leads to varieties of unnecessarily complex string delimiters to help solve the problem of quoting.

Python's triple quotes: {'''…''', """…"""}. γ€”β˜›Β Python: Quoting Strings〕

Perl's varying delimiters: {q(…), q[…], q{…}, m/…/}. γ€”β˜›Β Perl: Quoting Strings〕

Perl, PHP, unix shell's β€œheredoc”. γ€”β˜›Β PHP: String Syntax οΌ† Heredoc〕

γ€”β˜›Β Computer Language Design: String Syntax〕

HTML Entities

HTML entities, ⁖ { &amp;, &lt;, &gt;, &quot;, &alpha;, &#945;, &#x3b1;, …}.

Example:

<p>4 > 3 is true</p>

is written as:

<p>4 &gt; 3 is true</p>

The HTML entities are invented partly as a mechanism of avoiding symbol jam of the characters: < > &, and partly as a kludge for entering frequently needed symbols (⁖ Β© β„’ Ξ± β†’ …), and partly as a kludge to avoid char encoding and transmission problem (i.e. there's no UNICODE in 1980s, and only ASCII and a handful other basic encoding is widely recognized.). γ€”β˜›Β HTML/XML Entities List〕

For a concrete example of how this induced complexity in code, see: ASCII Jam Problem: HTML Entities.

Ampersand in URL, URL Percent Encoding

URL percent encoding and encoding Unicode in URL. Example:

http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_%28D%C3%BCrer%29

for

http://en.wikipedia.org/wiki/Saint_Jerome_in_His_Study_(DΓΌrer)

The complexity in resolving the ambiguity of the Ampersand char in URL and CGI protocol.

Representation for: Unprintable Character, Their Input Methods, Keyboard Keys

When a language or config needs to represent keystrokes, the ASCII jam made complexities and readibility much worse. See:

Fortress οΌ† Unicode

All these problems occur because we are jamming so many meanings into about 20 symbols in ASCII.

Fortress computer language Guy Steele
Fortress computer language

The language designer Guy Steele recently gave a very interesting talk. See: Guy Steele on Parallel Programing. In it, he showed code snippets of his language Fortress, which uses Unicode as operators.

For example, list delimiters are Unicode angle bracket ⟨ ⟩ ⁖ ⟨1,2,3⟩. List element extraction is using ⟦ ⟧. γ€”β˜›Β Matching Brackets in Unicode〕 It also uses the circle plus βŠ• as operator. γ€”β˜›Β Math Symbols in Unicode〕

Most of today's languages do not support Unicode in function or variable names, so you can forget about using Unicode in variable names (⁖ Ξ± = 3) or function names (⁖ β€œlambda” as β€œΞ»β€ or β€œfunction” as β€œΖ’β€), or defining your own operators (⁖ β€œβŠ•β€). γ€”β˜›Β Unicode Support in Ruby, Perl, Python, JavaScript, Java, Emacs Lisp, Mathematica〕

The Problem of Typing Unicode?

Today, it's trivial to create a keyboard layout to type any set of Unicode symbols you choose. See: How to Create a APL or Math Symbols Keyboard Layout.

blog comments powered by Disqus