MathCurvesSurfacesWallpaper GroupsGallerySoftwarePOV-Ray
ProgramingLinuxPerl PythonHTMLCSSJavaScriptPHPJavaEmacsUnicode ♥
Web Hosting by 1&1

Problems of Symbol Congestion in Computer Languages; ASCII Jam vs Unicode

Xah Lee, , …,

Vast majority of computer languages use ASCII as its character set. This means, it jams multitude of operators into about 20 symbols. Often, a symbol has multiple meanings depending on contex. Also, a sequence of chars are used as a single symbol as a workaround for lack of symbols. Even for languages that use Unicode as its char set (⁖ Java, XML), often still use the ≈20 ASCII symbols for all its operators. The only exceptions i know of are Mathematica, Fortress, APL. This page gives example and problems of symbol congestion.

Symbol Congestion Examples

Multiple Meanings of a Symbol

Here are some common examples of a symbol that has multiple meanings depending on context:

In Java, the SQUARE BRACKET [ ] is use for declaring array type main(String[] args). Also, part of syntax for array initiation myArray = new int[10];. Also, a delimiter for getting a element of array myArray[i].

In Java and most other languages, PARENTHESIS ( ) is used for expression grouping (x + y) * z, also as delimiter for arguments of a function call System.out.print(x), also as delimiters for parameters of a function's declaration main(String[] args).

In {C, Perl} and many other langs, COLON : is used as a separator in a ternary expression (⁖ (test ? "yes" : "no")), also as a namespace separator (⁖ use Data::Dumper;).

In URL, SOLIDUS / is used as path separator, also as indicator of protocol. ⁖ http://example.org/comp/unicode.html

In Python and many others, LESS-THAN SIGN “<” is used for “less than” boolean operator, but also as a alignment flag in its “format” method, also as a delimiter of named group in regex, and also as part of char in other operators that are made of 2 chars, ⁖ {<< <= <<= <>}.

The above are just some examples to illustrate the issue. There are perhaps 100 times more.

Examples of Multi-Char Operators

Here are some common examples of operators that are made of multiple characters:

Problems of Symbol Congestion

The tradition of sticking to the 95 chars in ASCII of 1960s is extremely limiting. It creates complex problems manifested in:

String Escape Mechanism

String Escape mechanism, for example, C's backslash {\n, \r, \t, \/, …}, widely adopted. A better solution would be Unicode symbols for unprintable chars. Example candidates:

(Note: string escape mechanism is ultimately necessary, but using proper Unicode can alleviate 99% of the need. (See also: Computing Symbols in Unicode.))

Crazy Leaning Toothpicks Syndrome

The backslash string escape mechanism directly leads to crazy leaning toothpicks syndrome, especially bad in emacs regex. Example:

"<img src=\"\\([^\"]+\\)\" alt=\"\\([^\"]+\\)\" width=\"\\([0-9]+\\)\" height=\"\\([0-9]+\\)\">"

Confusing Context Sensitive Symbols

This is particularly bad in regex. For example, ^ has multiple meanings depending on where it is placed. If in the beginning, it's a line beginning marker, if as first char inside square brackets ⁖ [^…] then it's a negation, otherwise it's literal.

Many other regex chars also have special meaning, some depends on their position. ⁖ ^ $ ? | . + \ - { } ( ) [ ] ….

Whether a symbol's meaning is literal, or whether their position changes meaning, or wether meaning is changed inside [], is completely ad hoc.

Complex Delimiters for Strings

The lack of bracketing symbols leads to varieties of unnecessarily complex string delimiters to help solve the problem of quoting.

Python's triple quotes: {'''…''', """…"""}. 〔☛ Strings in Perl & Python

Perl's varying delimiters: {q(…), q[…], q{…}, m/…/}.

Perl, PHP, unix shell's heredoc. 〔☛ PHP: String Syntax & Heredoc

(See also: Computer Language Design: String Syntax.)

HTML Entities

HTML entities, ⁖ { &amp;, &lt;, &gt;, &quot;, &alpha;, &#945;, &#x3b1;, …}.

Example. This:

<p>he wrote “4 > 3”</p>

is written as:

<p>he wrote &ldquo;4 &gt; 3&rdquo;</p>

The HTML entities are invented partly as a mechanism of avoiding symbol jam of the characters: < > &, and partly as a kludge for entering frequently needed symbols (⁖ © ™ α → …), and partly as a kludge to avoid char encoding and transmission problem (i.e. there's no UNICODE in 1980s, and only ASCII and a handful other basic encoding is widely recognized.). 〔☛ HTML/XML Entities List

For a concrete example of how this induced complexity in code, see: ASCII Jam Problem: HTML Entities.

Ampersand in URL, URL Percent Encoding

URL percent encoding and encoding Unicode in URL. Example:

http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29

for

http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

The complexity in resolving the ambiguity of the Ampersand char in URL and CGI protocol. See: URL Percent Encoding and Ampersand CharURL Percent Encoding problemsJavaScript Encode URL, Escape String.

Representation for: Unprintable Character, Their Input Methods, Keyboard Keys

When a language or config needs to represent keystrokes, the ASCII jam made complexities and readibility much worse. See:

Fortress & Unicode

All these problems occur because we are jamming so many meanings into about 20 symbols in ASCII.

The language designer Guy Steele recently gave a very interesting talk. See: Guy Steele on Parallel Programing. In it, he showed code snippets of his language Fortress, which uses Unicode as operators.

For example, list delimiters are Unicode angle bracket ⟨1,2,3⟩. 〔☛ Matching Brackets in Unicode〕 It also uses the circle plus “⊕” as operator. 〔☛ Math Symbols in Unicode

Most of today's languages do not support Unicode in function or variable names, so you can forget about using Unicode in variable names (⁖ α = 3) or function names (⁖ “lambda” as “λ” or “function” as “ƒ”), or defining your own operators (⁖ “⊕”). 〔☛ Unicode Support in Ruby, Perl, Python, JavaScript, Java, Emacs Lisp, Mathematica

The Problem of Typing Unicode?

Today, it's trivial to create a keyboard layout to type any set of Unicode symbols you choose. See: How to Create a APL or Math Symbols Keyboard Layout.

blog comments powered by Disqus