Tired of the standard bodies telling us what to do and change their altitude? Tired of the SGML/HTML/XML/XHTML/HTML5 changes? Tire no more, here's a new proposal that will make it all better.
The aim is a very simple syntax, 100% regularity, leaner, trivial to parse using any language.
Here's a sample file in standard XML ATOM webfeed.
<?xml version="1.0" encoding="utf-8"?> <feed xmlns="http://www.w3.org/2005/Atom" xml:base="http://xahlee.org/emacs/"> <title>Xah's Emacs Blog</title> <subtitle>Emacs, Emacs, Emacs</subtitle> <link rel="self" href="http://xahlee.org/emacs/blog.xml"/> <link rel="alternate" href="http://xahlee.org/emacs/blog.html"/> <updated>2010-09-19T14:53:08-07:00</updated> <author> <name>Xah Lee</name> <uri>http://xahlee.org/</uri> </author> <id>http://xahlee.org/emacs/blog.html</id> <icon>http://xahlee.org/ics/sum.png</icon> <rights>© 2009, 2010 Xah Lee</rights> <entry> <title>Using Emacs's Abbrev Mode for Abbreviation</title> <id>tag:xahlee.org,2010-09-19:215308</id> <updated>2010-09-19T14:53:08-07:00</updated> <summary>tutorial</summary> <link rel="alternate" href="http://xahlee.org/emacs/emacs_abbrev_mode.html"/> </entry> </feed>
Here's how it looks like in HTML6:
〔?xml 「version “1.0” encoding “utf-8”」〕 〔feed 「xmlns “http://www.w3.org/2005/Atom” xml:base “http://xahlee.org/emacs/”」 〔title Xah's Emacs Blog〕 〔subtitle Emacs, Emacs, Emacs〕 〔link 「rel “self” href “http://xahlee.org/emacs/blog.xml”」〕 〔link 「rel “alternate” href “http://xahlee.org/emacs/blog.html”」〕 〔updated 2010-09-19T14:53:08-07:00〕 〔author 〔name Xah Lee〕 〔uri http://xahlee.org/〕 〕 〔id http://xahlee.org/emacs/blog.html〕 〔icon http://xahlee.org/ics/sum.png〕 〔rights © 2009, 2010 Xah Lee〕 〔entry 〔title Using Emacs's Abbrev Mode for Abbreviation〕 〔id tag:xahlee.org,2010-09-19:215308〕 〔updated 2010-09-19T14:53:08-07:00〕 〔summary tutorial〕 〔link 「rel “alternate” href “http://xahlee.org/emacs/emacs_abbrev_mode.html”」〕 〕 〕
The standard XML markup bracket is simplified using simple brackets in lisp style. For example, this code:
Is written as:
The delimiters used are:
|Character||Unicode Code Point||Unicode Name|
|〔||U+3014||LEFT TORTOISE SHELL BRACKET|
|〕||U+3015||RIGHT TORTOISE SHELL BRACKET|
<h1 id="xyz" class="abc">HTML6</h1>
〔h1 「id “xyz”, class “abc”」 HTML6〕
The attributes are specified by corner brackets. Items inside are a sequence of pairs, separated by a comma. The value must be quoted by curly double quotes.
|Character||Unicode Code Point||Unicode Name|
|「||U+300c||LEFT CORNER BRACKET|
|」||U+300d||RIGHT CORNER BRACKET|
|“||U+201c||LEFT DOUBLE QUOTATION MARK|
|”||U+201d||RIGHT DOUBLE QUOTATION MARK|
To include a literal tortoise shell bracket characters in data, use
〕, similarly for other Unicode chars.
The only chars you need to escape are 〔tortoise shell brackets〕, 「corner brackets」, “double curly quotes”.
There's no Named Entities. For example,
& is literal, it should not be rendered as “&”.
Character “entities” is allowed in hexadecimal format ⁖
α for “α”.
Identical to XML.
Source code must be UTF8 only. Nothing else.
File name extension is “.html6”.
The semantics should follow XHTML5.
What's wrong with XHTML/HTML5 exactly?
The politics of standard body changes, and their attitude about what is correct also changes unpredictably. In around 2000, we are told that XML and XHTML will change society, or, at least, make the web correct and valid and far more easier to develop and flexible. Now it's a decade later. Sure the web has improved, but as far as HTML/XHTML and browser rendering goes, it's still syntax soup with extreme complexities. 99.99% of web pages are still not valid, and nobody cares. Google doesn't care. Apple doesn't care. In Google's hundreds of tips to webmasters, none of it ever advocates HTML validation. Google Earth itself generates invalid KML. Some 99.9% of the HTML files produced by Google or Apple are not valid HTML. Major browsers still don't agree on their rendering behavior. Web dev is actually far more complex, involving tens or hundreds of tech that hardly a person even knows about (ajax, JSON, lots XML variations). It's hard to say if it is better at all than the HTML3 days with “font” and “table” tags and gazillion tricks. The best practical approach is still trial ＆ error with browsers.
And, now HTML5 comes alone, from a newfangled hip group primarily from current big corporations Google and Apple, with a attitude that validation is overrated — a insult to the face about the XML mantra from w3c, just when there starts to be more and more sites with correct XHTML and Microsoft's Internet Explorer getting on track about correctness.
For some personal story about how the change of standard body attitude effect practical programing, see:
Why not just adopt SXML from the lisp world?
Lisp's SXML is not a stand-alone syntax for the need of the web. SXML's syntax is designed to be compatible with lisp lang's existing syntax. Lisp syntax (aka sexp) has several syntactical irregularities. It is not 100% of nested paren of the form
(a b c …). SXML is easy for lispers to adopt, but harder for other languages and communities. (For detail of lisp's syntax irregularities, see: Fundamental Problems of Lisp.)
The following are explanation on how several of lisp's syntax for XML breaks the tree-and-syntax structural correspondence that is inherent in XML.
XML as textual representation of a tree has a quirk, in that each node has this special thing called “attributes” (aka “properties”). The “attribute” is not a node of the tree, but rather, is a special info attached to a node. Here's a example HTML:
<h1 id="xyz" class="abc">A B C</h1>
The standard lisp syntax to represent attributes, adopted from lisp's similar concept of “properties” of lisp's “symbols”, is this:
(h1 :id "xyz" :class "abc" A B C)
The way this works is by creating a extra rule on the first char of a name. If the name starts with
:, then that name is considered the name of a property, and the next element is considered its value. This special rule breaks a fundamental principle of XML syntax. That is, the lexical structure of the source code no longer corresponds to the semantic structure. The semantics of the source code changes depending on the first char of a atom.
Another way to represent XML's attribute, adopted in some lisp code based on lisp's “alist” (aka associative array) syntax, is this:
(h1 ((id . "xyz") (class . "abc")) A B C)
This too, has syntactical ambiguity.
From purely lexical analysis, the whole
((id . "xyz") (class . "abc")) can be interpreted as a node by itself, where the first element is again a node.
But also here, it uses lisp's special “cons” syntax
(id . "xyz") which is itself ambiguous at the syntax level. It can be considered as a node named “id” with 2 branches
"xyz", like this:
id . "xyz"
or it can be considered as a node named “cons” with 2 branches
"xyz", like this:
cons id "xyz"
Another common lisp syntax for attributes, from SXML, is this:
(h1 (@ (id . "xyz") (class . "abc")) A B C)
Here, again a special rule is created. When the first element's name is just “@”, then that parenthesized expression is considered to be a property list, not a node.
So, in conceiving HTML6, a solution for getting rid of syntax ambiguity for node vs attributes is to use a special bracket for properties/attributes of a node. ⁖
〔h1 「id “xyz”, class “abc”」 A B C〕. This is a pure syntactical solution.
Why use weird Unicode characters for brackets?
Unicode is widely adopted today and is very practical. 〔☛ Unicode Popularity: How Popular is UTF-8?〕 It is the default char set for many langs (⁖ Java, XML, Haskell, GoLang). Unicode also has a lot proper matching pairs. 〔☛ Matching Brackets in Unicode〕 Today is a good time to adopt the wide range of proper symbols provided in Unicode, instead of relying on the very limited number of ASCII characters of the 1960s.
The straight double quote character
" (ASCII 34) is not a matching pair; it has several practical problems when used in a computer language. For example, it needs context to know which quote chars are paired. Also, it is difficult to recover from a missing quote. (this problem is especially pronounced in text editors for syntax highlighting.) A proper matching pair allow programs and editors to more easily correctly determine the quoted string, and thus easier to know its position in a tree, and makes it easier to implement features such as navigating the tree in a editor. (For more detail, see: Problems of Symbol Congestion in Computer Languages; ASCII Jam vs Unicode.)
The problem of inputting special chars of Unicode can be trivially solved by text editors. For example, Emacs, Mathematica, Microsoft Word, all have simple and efficient ways to enter commonly used special chars such as « ™ © é ¶ “” π ». (See: Emacs Math Symbols Input Mode (xmsi-mode) ◇ How Mathematica does Unicode? ◇ How to Create a APL or Math Symbols Keyboard Layout.)
If we use ASCII brackets () and  for HTML6, then it means a lot ugly escape will need to happen in the content text.
The core idea of HTML6 is that the syntax is designed specifically as a 2-dimensional textual representation of a tree, and with a added special syntax for XML's concept of attributes.
The advantage of this is that it should be extremely easy to parse. The syntax can be specified in perhaps just 3 lines of parsing expression grammar (PEG), and PEG libraries exists for Perl, Python, Ruby, Lua, C, C#, Java, OCaml/F#, Clojure, … A parser for HTML6 can be trivially written without relying on PEG.
Any thoughts about flaws?
It is probably hopeless for browsers to adopt this. But if you are involved in standard bodies of XML or HTML5, please consider this, and consider more about correctness and validation. XML is a move in the right direction, with huge consequences in various XML languages and formats (JSON, XSLT, XSL, XQUERY, o:XML…, Microsoft Office Open XML, etc.) Whatever new features of HTML5 can be expressed as XML with a new DTD (⁖ XHTML 5). HTML5 was created in part to address w3c's slowness in responding to industrial changes, and in part to address verbosity of XML syntax. HTML5 by itself does not introduce any new technical concepts. The force behind HTML5 is almost purely corporate adoption, and mostly existing practices from corporations. But the attitude it brought about seems to be a step backward, towards corporate sponsored tags (much from Google) and technologies (⁖ much of canvas is from Apple, a low-level pixel-drawing garbage in comparison to SVG), odd-end special tags, more special syntaxes, less focus about correctness, another new syntax/format in the HTML/XML/XHTML/DTD-sniffing soup.
See also: Emacs Lisp: html6-mode.