XML is not S-Expressions (By Paul Prescod) (2003)

XML is not S-Expressions

By Paul Prescod, 2003-04-15

There exists a persistent meme that XML is just a newfangled, verbose form of s-expressions. For [1] example, “XML is little more than a notation for trees and for tree grammars, a verbose variant of Lisp S-expressions coupled with a poor man's BNF (Backus-Naur form)” [2] and “ a poor copy of sexprs.”

1. ~~http://www.research.avayalabs.com/user/wadler/xml/~~
2. http://c2.com/cgi/wiki/wiki?edit=XmlSucks

These people do not appreciate that s-expressions were simply not designed to solve the same problems XML was designed to solve, and are not properly engineered to do so.

Starting with Syntax

The first thing to recognize is that syntax matters. If it didn't, we would all be using binary formats. Most Lisp people agree but feel that their syntax is better than XML. Let's compare two idiomatic examples:

(document author: "paul@prescod.net"
  (para "This is a paragraph " (footnote "(better than the one under there)") ".")
  (para "Ha! I made you say \"underwear\"."))

<document author="paul@prescod.net">
<para>This is a paragraph <footnote>(just a little one)</footnote>.</para>
<para>Ha! I made you say "underwear".</para>
</document>

There is no standard syntax for attributes across Lisp-like languages, so I've chosen to use the syntax from DSSSL.

The XML version is obviously a little more verbose. It has one extra line and if you count characters you'll see it loses there too. To me, the XML version looks easier to read, but I look at XML all day. Nevertheless, you can imagine which would be preferable to a standard HTML hacker. Remember that XHTML (based on XML) is intended to be the replacement for HTML.

Beyond this familiarity, the XML version has several advantages as a document processing technology:

The XML one defaults to treating random characters as text, not as markup. For instance after the “footnote” tag, the s-expression version requires you to wrap the trailing period in quotation marks. In the XML, anywhere within a element, any text is presumed to be text. There are a [3] couple of [4] projects that [5] adapt S-expression syntax to markup-related problems and this is the primary problem they solve. The only exception I have found is [6] LAML, wherein the author suggests that you use an intelligent editor that can work around the problem.
The XML one does not use standard human-punctuation characters as markup. It doesn't require quoting of apostrophes, double quotes or parentheses. The only characters XML cares about are angle-brackets and ampersands. The need to escape does not arise with most prose documents.

3. ~~http://www-sop.inria.fr/mimosa/fp/Scribe/~~
4. http://nuclight.ipfw.ru/vadim/ProgLanguageComparison/xml-sexprs.html
5. http://brl.sourceforge.net/brl_toc.html
6. ~~http://www.cs.auc.dk/~normark/scheme/distribution/laml/info/laml-motivation.html~~

Redundancy Is Good

I often hear that XML is like S-expressions except that it is more verbose because the inventors of XML “didn't realize” that it is possible to elide the element type name from the end-tag. This is not true: there was vigorous debate on the requirement for named end-tags and the current design won out.

The XML version is verbose but it is also more robust in the face of errors. For instance, I'll make the same mistake in both documents:

(document author: "paul@prescod.net"
  (para "This is a paragraph " (footnote "(better than the one under there)" ".")
  (para "Ha! I made you say \"underwear\"."))

<document author="paul@prescod.net">
<para>This is a paragraph <footnote>(just a little one).</para>
<para>Ha! I made you say "underwear".</para>
</document>

The error is the missing footnote end-tag. It might have taken you a minute to find it but now that you've found it you can probably see easily where the end-tag should go.

Now imagine that you are a program required to find syntactic (“well-formedness”) errors and report them to the user. In the XML one, it is possible to notice that the footnote should have ended before the para ended. In the S-expression version, the software does not know that there is a missing parenthesis until it gets to the end of the document. In a sufficiently long document, this bad error checking could cause major problems. Even where there are no errors, it can be tricky in less powerful text editors to know which parentheses match each other.

These are both well-documented problems with Lisp. Smart editing tools can help but the cannot solve the problem. Allowing multiple delimiter characters can help a little bit but it also makes the text less regular and harder to read. It is important to remember that there have been attempts to apply S-expressions to documentation and they have failed to gain popularity. I believe that these are the reasons they have failed.

For example, the canonical documentation of the Scheme and Lisp standards is maintained not in S-expression syntax but in LaTeX syntax. If S-expressions were easier to edit, it would be most logical to edit the document in S-expressions and then write a small Scheme program to convert S-expressions into a formatting language like LaTeX. This is, what XML and SGML people have done for decades, because they really do believe that their technologies are better for document editing and maintenance than LaTeX. The Lisp world seems to have come to a different conclusion about S-expressions versus LaTeX.

By the way, LaTeX (the language used for much Lisp-world documentation) does put tag-names in end-tags (at least for large blocks of content) just as XML does. So do most, if not all, other markup languages, whether from the SGML family or not.

Family Matters

So in my opinion, XML's syntax is wildly better than S-expressions as a language to integrate the worlds of documentation and data. But this does not at all talk about semantics. XML was always envisioned as a member of a family of related standards:

DTDs, RELAX and XML Schema define constraints on individual instances of XML documents. This is a necessary feature for web services. I've learned that there have been a few grammar languages for S-expressions, used for one particular project or another, but never gaining widespread usage, for instance the EDIF project had a [7] grammar language but I don't believe it was ever used outside of EDIF.
XPointer allows change-resistant addressing of particular nodes in the parse tree (“infoset”) in URIs.
XQuery allows the expression of sophisticated queries on XML documents.
XSLT allows the declarative, pattern-based transformation of XML documents.
CSS and XSL allow XML to be presented to human readers with formatting.

7. ~~http://www.edif.org/documentation/BNF_GRAMMAR/estruct0.d~~

You cannot evaluate the hype around XML without incorporating all of these technologies into the evaluation. Cumulatively there are decades of person-effort embodied in those specifications!

Nor is it an accident of history that Lisp programmers never came up with these technologies for Lisp data. The central idea of the XML family of standards is to separate code from data. The central idea of Lisp is that code and data are the same and should be represented the same. The Lisp community's idea of “Schema” would likely be “Lisp program”. The Lisp community's idea of “addressing language” would likely be “Lisp program.” The Lisp community's idea of “query language” would likely be “Lisp program.” Unfortunately this response ignores the [8] Principle of Least Power.

8. http://www.w3.org/DesignIssues/Principles.html#PLP

Nevertheless…

XML is not perfect. For instance, it might be better if the tagnames in the end-tag could be omitted. I was a strong proponent of this, but there was a sense that this would make it more difficult to process with simplistic text manipulation tools such as regular expression-based languages, and abuse of the optionality would allow users to re-introduce all of the problems that XML's redundancy is meant to solve (see above). Perhaps entities should start with a character that is even less common than the ampersand (“@”?). The element/attribute distinction is a constant irritant for those who wield Occam's razor. We could continue all day discussing minor changes but many of those decisions were made in the 1980s, as part of SGML's standardization process. I'll agree emphatically that XML is not perfect — just better for its problem domain than s-expressions.

You could also argue that it might not be appropriate to define one language for both document processing and data processing. My own sense is that the benefits of integrating the two domains outweighs the costs in having a technology that is not optimized for the data processing domain.

Notes from Xah Lee

By Xah Lee. Date: 2013-02-11

This article is by Paul Prescod, a XML expert, originally at his website ~~http://www.prescod.net/xml/sexprs.html~~. I got a copy from ~~http://nuclight.ipfw.ru/vadim/ProgLanguageComparison/xml-sexprs.html~~.