Syntax Design Problem: Irregularity vs Convenience

By Xah Lee. Date: . Last updated: .

one of the idiocy of HTML spec is that the “pre” tag discards the first blank line.

for example, if you have:

<pre style="border:solid thin red">
x = 3
</pre>

Here is how your browser renders it:

x = 3

The first blank line is ignored. However, only the FIRST blank line is ignored. If you have 2 blank lines in the beginning, it'll be rendered with 1 blank line.

<pre style="border:solid thin red">

x = 3
</pre>

x = 3

We do this, because, it's convenient for coder. Because, we like to see the pre content aligned to the left in raw HTML.

For example, you rather write it this way:

<pre>
1
2
3
</pre>

than

<pre>1
2
3</pre>

this is a idiocy because sacrifice syntax simplicity with coder convenience.

According to html spec, line break immediately following a start tag is to be ignored.

html line breaks in tag 2021-12-04
html line breaks in tag 2021-12-04 [http://www.w3.org/TR/html4/appendix/notes.html#notes-line-breaks]

SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception.

The problem comes, when you have programs that deal with code. That's why, in programing, computing tech, there are one hundred exceptions, irregularities, and thus bugs, headaches. The worst offender is unix shell syntax. [see Unix Shell Syntax Irregularities Galore]

At first, syntax conveniences like these are nice. The rules are lax, and you use it without problems. But then, once the language grew, and you deal with many languages, you find everywhere there's exceptions, special rules, and you can't remember what rule they thought were convenient at the time, and there is no simple systematic rule about them. Each one becomes a ad hoc syntax soup of hell.

C = Syntax Soup

Almost all languages abuse syntax sugar for convenience, to various degrees. C language syntax is worst. It is basically of no design. Most of the syntax “design” is based on user's typing convenience at the time. [see Why I Hate the C Language]

LISP = Irregular Monster Hidden

Even lisp, didn't escape this problem. Contrary to popular belief, there are quite a few irregularities in lisp syntax. [see Fundamental Problems of Lisp]

XML Syntax is Regular??

Even XML, whose syntax is more regular than lisp, cannot escape irregularities.

Here is a sample valid XML, can you spot the syntax irregularity?

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:base="http://xahlee.info/comp/">

  <title>…</title>
  <subtitle>…</subtitle>
  <link rel="self" href="blog.xml"/>
  <link rel="alternate" href="blog.html"/>
  <updated>…</updated>
  <author>
    <name>…</name>
    <uri>…</uri>
  </author>
  <id>…</id>
  <icon>…</icon>
  <rights>…</rights>

  <entry>
    <title>…</title>
    <id>…</id>
    <updated>…</updated>
    <summary>…</summary>
    <content type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml">
        <p><a href="…">…</a></p>
        <p>…</p>
        <p>…</p>
      </div>
    </content>
    <link rel="alternate" href="…"/>
  </entry>

</feed>

Sin: Omitting Ending Tags

Another major problem of HTML irregularity is omitted ending tags. HTML, due to its SGML baggage, has what's called self-closing tags. When XML was hot, it fixed the problem by requiring all such tags with a special syntax of a slash at the end. For example:

<img src="cat.jpg" alt="cat" />

but the maverick HTML5 started by commercial Apple, Mozilla, Opera, twarted all this, and in fact wanted to kill XML and is succeeding.

Big offender is Google, telling users to omit ending tags in their HTML5 style guide. The consequence is that people will omit ending tags that are not allowed to be omitted, and we are back to syntax-soup quirk-mode hell. See:

How to Solve the Syntax Sugar Problem?

This problem should be solved by clear separation of issues. For example, XML takes the regularity approach, and you can have editors that represent the data to the user in a most easy-to-read format, or structural editors.

Another approach is Mathematica, where you have a systematic syntax layer. So, at the bottom layer, it's purely nested like XML and LISP, but without irregularities, and another layer on top, that supports all the syntax warts we human have got used to, as in traditional math notation and infix notation. Yet, there's a simple, regular, systematic, transformation rules that can change these two layers easily.

Instead of syntax sugar, you should have a 100% regular syntax, or a layer with systematic rule, and let editor deal with it, and present code to user in a different layer.