What Does it Mean When a Programing Language Claims “Whitespace is Insignificant”?
Often, in programing language tutorial or spec, you'll see a claim that whitespace in the language is insignificant. This is often wrong.
What does it mean that “whitespace is insignificant”? I take it to mean one or more of the following:
- Whitespace doesn't matter. It can be omitted in source code without changing the meaning of the code.
- Several types of whitespace are equivalent. For example, newline and space are interchangeable.
- A sequence of whitespaces is equivalent to a single whitespace.
Here is some examples in Perl, showing all 3 criterions above, but only in a local context around +
:
# -*- coding: utf-8 -*- # perl # all of the following are equivalent print 3+4; # ← normal print 3 + 4; # extra white space ok print 3 + 4; # EOL char is either equivalent to space or can be added
Here is a similar example in emacs lisp:
;; -*- coding: utf-8 -*- ;; emacs lisp ;; all of the following are equivalent (+ 3 4) (+ 3 4) (+ 3 4)
Comment Syntax to End of Line
Almost all popular languages fail in a strict sense of “whitespace is insignificant”. Because, first of all, if the language's comment syntax runs to end of line (For example, Bash, Perl, Ruby, C, C++, Java, Lisp, Haskell, OCaml, etc), then obviously it fails, because the EOL char has significant meaning.
Whitespace Inside Strings
If the language contain string datatype where newline inside is significant, then they fail. Basically, you can't simply replace all newline by space in source code and expect the program to behave the same.
All popular language have string where whitespace inside is meaningful. However, we could ignore this criterion as design decision, because, otherwise you couldn't have literal text in your source code, and it'd be a major inconvenience. (whether one could have a language where literal text is not allowed yet still convenient to read/type, perhaps by a automatic preprocessing in editor that reformat/display on the fly (For example, Mathematica), is a open question to me. To be researched.)
Which Language is Actually “Whitespace Insignificant”?
Now, if we ignore whitespace inside strings, and ignore the comment to end-of-line syntax, then, many popular language might qualify as “whitespace Insignificant”, but not well-defined.
Can we come up with a mathematically precise definition of “whitespace insignificance” to popular langs such as C, Java? What would the definition be like? Would it just be a few sentences, or tens of special cases?
Ruby Example
Here is a Ruby example.
Following is a syntax error:
# -*- coding: utf-8 -*- # ruby 1.9 aa = [1,2,3] aa.each { |xx| p xx }
But if you remove a newline, it's ok:
# -*- coding: utf-8 -*- # ruby 1.9 aa = [1,2,3] aa.each { |xx| p xx }
This means that a newline has significant meaning in Ruby. (not even considering inside string or as to-end-of-line comment syntax.)
Here is the tech execuse. (thanks to Bjoern Paschen)
“If no block is given, an enumerator is returned instead.” http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-each
The Array.each method is fine by itself, it returns an Enumerator which is then not assigned but thrown away in the above code.
The code block stands by itself then and does not make sense. You can use a backslash at the end of the Array.each line to skip the newline.
David Flanagan's Ruby Whitespace Examples
addendum: David Flanagan's book Ruby Programming Language Buy at amazon gave many detailed example of Ruby's whitespace dependency.
# -*- coding: utf-8 -*- # whitespace issues in Ruby aa = 3 + 4 bb = 3 + 4 p aa # 7 p bb # 3
# -*- coding: utf-8 -*- # whitespace issues in Ruby def f(x) x*2 end p f(3+2)+1 # 11 p f (3+2)+1 # 12
Why is This Important?
The meaning of whitespace significance issue is important in simplicity of language syntax grammar. It's related to the concept that each character or class of character or character sequence, has one and one only meaning, regardless of its neighboring characters (i.e. not dependent on context. (this is different from the concept of “context-free language”)). Lisp and Mathematica comes close.
If a language where whitespace insignificant is precisely defined as one or more of the criterions above, then, it means, the lexical grammar is simple. With such simplicity, you can then have syntactic layers on top of it, or in editor, that display or reformat the code in a number of ways (For example, HTML, Mathematica) on the fly.
This issue shouldn't be confused with readability or convenience of typing the code. They should be in a different layer.
Almost all languages, ignore this. Readability and programer convenience is mixed in into the design of the syntax. Worst examples are unix shell tools, C. On the other extreme, language such as Python, where whitespace is significant, is a design flaw because that means the readability is hardcoded into the semantics. There can never be a auto-formatter or displayed in a different way.
Research TODO
Survey popular languages and give a precise definition of their “whitespace significance”. (ignoring line-comment and string.)
hackernews mention. https://news.ycombinator.com/item?id=5517593
- Language Syntax: Brackets vs Begin/End
- Concepts and Confusions of Prefix, Infix, Postfix and Lisp Notations
- Fundamental Problems of Lisp, Syntax Irregularity
- What Are Good Qualities of Computer Language Syntax?
- What is Function, What is Operator?
- Problems of Symbol Congestion in Computer Languages; ASCII Jam vs Unicode
- Programing Language Design: String Syntax
- The TeX Pestilence: Why TeX LaTeX Sucks
- AutoHotkey Syntax Problems
- Programing: the Harm of Hard-wrapping Lines
- Tab vs Space in Source Code