Computer Language Design: String Syntax

By Xah Lee. Date: . Last updated: .

This article discuss string syntax in a computer language design.

your regex brain
Your Regex Brain

Problem with Escapes

Typically, the syntax for string in a lang look like this:

# perl
$mystr = "abc";

However, if your string contains double quote, then you need to escape them (usually with backslash), and this is ugly, hard to read, and inconvenient for programers. For example, suppose your code processes a lot HTML:

# perl
$mystr = "<link rel=\"stylesheet\" type=\"text/css\" href=\"../xyz.css\">";

Here's a practical example from real emacs lisp code. The backslash gets annoying.

(while (search-forward "<div class=\"d25935\">If you enjoyed this site, please donate! <a href=\"http://example.com/thanks.html\">Thanks!</a><form action=\"https://www.example.com/cgi-bin/webscr\" method=\"post\"><div><input type=\"hidden\" name=\"cmd\" value=\"_s-xclick\"><input type=\"hidden\" name=\"hosted_button_id\" value=\"1234567\"><input type=\"image\" src=\"https://www.example.com/en_US/i/btn/btn_donateCC_LG.gif\" name=\"submit\"><img alt=\"\" src=\"https://www.example.com/en_US/i/scr/pixel.gif\" width=\"1\" height=\"1\"></div></form></div>" nil t)

It's much worse with regex, especially emacs regex:

(while (search-forward-regexp
 "<a href=\"\\([^\"]+\\)\"><div class=\"img\"><img src=\"\\([^\"]+\\)\" alt=\"\\([^\"]+\\)\" width=\"\\([0-9]+\\)\" height=\"\\([0-9]+\\)\"></div></a>" nil t)
(replace-match "<div class=\"img\"><a href=\"\\1\"><img src=\"\\2\" alt=\"\\3\" width=\"\\4\" height=\"\\5\"></a></div>" t))

This is called Leaning toothpick syndrome. (and if you think that's tolerable, have a look at the HTML source code of this page on that section.)

Other Forms of Escape: HTML Entities, Hex Code Literals

Another form of escape is HTML entities or hex code. For example, in HTML, the ampersand char can be written as &amp; or &#38; or &#x26;. The char “b” can be written as &#98; or &#x62. Using so-called “entities” is necessary for the chars {<,>, &}.

Similarly, in Java and many other langs, hexidecimal code can be used. For example, “b” can be written as \u0062. See:

Variable String Delimiters (Perl, PHP, Python)

One solution is to use different delimiters for the string. Perl, Python, take this approach.

For example, in perl, the following evaluates to the same string:

# perl
$x = "abc";
$x = 'abc';
$x = q[abc];
$x = q(abc);
$x = q{abc};

Basically, it allows different chars to be used for the string delimiter. This way, if your string contains ", you can switch to a different quoting delimiter, then you don't need to do the escapes, and your string is more readable. Here's how it looks, so much more clear:

# perl
$mystr = q[<link rel="stylesheet" type="text/css" href="../xyz.css">];

[see Perl: Quoting Strings]

Python also has some way to avoid leaning toothpick syndrome. For example, the following lines all evaluate to the same string:

# python
x = "abc"
x = 'abc'
x = """abc"""
x = '''abc'''

[see Python: Quote String]

heredoc

Another solution, used by Perl and PHP, orignated from unix shell, is called “heredoc”. Basically, it uses a random string as delimiter, and anything in between is literal. Here's a example.

# perl
$mystr = <<'randomstringhere823497';
<link rel="stylesheet" type="text/css" href="../xyz.css">
randomstringhere823497

[see PHP: String Syntax, Heredoc]

Can Escape be Completely Avoided?

On , Ron Garret wrote:

And just for good measure, some «European style quotes» and “balanced smart quotes” which I intend some day to try to convince people to start using to eliminate the scourge of backslash escapes. But that's a topic for another day.

On , Spiros Bousbouras 〔spi…@gmail.com〕 wrote:

I don't see how they would help to eliminate backslash escapes. Let's imagine that strings were delimited by « and ». If you wanted a string which contained a » you would still need to escape it.

Using rich varieties of matching pair chars in Unicode can greatly eliminate many escapes and improves code readability. [see Matching Brackets in Unicode] Compare the following 2 elisp code:

(insert "<span class=\"ref\"><a href=\"" URL "\">" swd "</a></span>")
(insert 「<span class="ref"><a href="」 URL 「">」 swd 「</a></span>」)

Ultimately, escape can not be completely eliminated, doesn't matter how many variation of delimiters your language have (unless it's infinite, such as “heredoc”, or non-syntactic methods). This is because, if you lang is a general lang, inevitably it'll be used to parse its own source code. And there will be occasions when the text you want to parse is a complete enumeration of all possible string delimiters of your lang. (For example, a tutorial of language X in HTML containing examples of X and processed by X) So here, doesn't matter what delimitor you choose, it occures in the string you want to quote.

“heredoc” is a ugly solution to this. Another possible solution is variable repetition. For example, consider any repetition of a matching pair delimiter is also a valid syntax:(((abc))), {{abc}}, 「「「「abc」」」」, etc, or any combination of repetition of variable string delimiters, for example, ([((【《『‹abc›』》】))]).

(Note: here, the desired property is the ability to quote a text without modifying the text in any way. So, this excludes adding any form of escapes, or inventive ways such as adding a Tab char in front of each line. Also, am thinking in the context of computer language syntax. This excludes semantic solutions by specifying how many chars/lines to read in and not using any delimiters. Thanks to reddit discussion At http://www.reddit.com/r/programming/comments/fux1s/computer_language_design_string_syntax/ .)

Disadvantage of Variable String Delimiters

The variable quoting chars also introduces some complexity. Namely, each delimiter symbol in your lang now has multiple meanings, context dependent, and or, you have multiple symbols for the same semantic. [see Problems of Symbol Congestion in Computer Languages; ASCII Jam vs Unicode]

For example, one language that does not have multiple string delimiters is emacs lisp (or lisps in general). In emacs lisp, a string is always delimited by the double straight quote ("). Emacs lisp has the worst readability problems of leaning toothpick syndrome. However, one advantage is that string syntax has a very simple logic. For example, you can ALWAYS locate ALL strings in the source code by searching for double straight quote char. In langs with variable quotes such as perl, this can no longer be true. You have to search several chars, and for each occurrence you have to judge based on adjacent chars.

Similarly, in Mathematica, paren is used for one single purpose only, always. It's delimiter for specifing evaluation order of expressions (For example, (3+4)*2). The square bracket [] has one single purpose only. It's delimiter for function arguments, for example: f[x_]:= x + 1, f[3]. The curly brackets {} again has one single purpose only. It's delimiter for list. For example, {1,2}. In traditional math notation and most comp langs, it's all context dependent soup.

Doesn't matter which is your philosophy in lang design with regards to quoting mechanism, Unicode introduce many proper matching pairs that are helpful, and avoid multiple semantic meanings for a given char.

(this essay is originally a post from online forum discussion at Source groups.google.com)

If you have a question, put $5 at patreon and message me.