What Does it Mean When a Programing Language Claims “Whitespace is Insignificant”?

By Xah Lee. Date: . Last updated: .

Often, in programing language tutorial or spec, you'll see a claim that whitespace in the language is insignificant. This is often wrong.

What does it mean that “whitespace is insignificant”? I take it to mean one or more of the following:

Here is some examples in Perl, showing all 3 criterions above, but only in a local context around +:

# -*- coding: utf-8 -*-
# perl

# all of the following are equivalent

print 3+4;         # ← normal

print 3 +   4;       # extra white space ok

print 3 +
4; # EOL char is either equivalent to space or can be added

Here is a similar example in emacs lisp:

;; -*- coding: utf-8 -*-
;; emacs lisp

;; all of the following are equivalent
(+ 3 4)

   (+ 3    4)

   (+ 3
      4)

Comment Syntax to End of Line

Almost all popular languages fail in a strict sense of “whitespace is insignificant”. Because, first of all, if the language's comment syntax runs to end of line (For example, Bash, Perl, Ruby, C, C++, Java, Lisp, Haskell, OCaml, etc), then obviously it fails, because the EOL char has significant meaning.

Whitespace Inside Strings

If the language contain string datatype where newline inside is significant, then they fail. Basically, you can't simply replace all newline by space in source code and expect the program to behave the same.

All popular language have string where whitespace inside is meaningful. However, we could ignore this criterion as design decision, because, otherwise you couldn't have literal text in your source code, and it'd be a major inconvenience. (whether one could have a language where literal text is not allowed yet still convenient to read/type, perhaps by a automatic preprocessing in editor that reformat/display on the fly (For example, Mathematica), is a open question to me. To be researched.)

Which Language is Actually “Whitespace Insignificant”?

Now, if we ignore whitespace inside strings, and ignore the comment to end-of-line syntax, then, many popular language might qualify as “whitespace Insignificant”, but not well-defined.

Can we come up with a mathematically precise definition of “whitespace insignificance” to popular langs such as C, Java? What would the definition be like? Would it just be a few sentences, or tens of special cases?

Ruby Example

Here is a Ruby example.

Following is a syntax error:

# -*- coding: utf-8 -*-
# ruby 1.9

aa = [1,2,3]
aa.each
{ |xx|
  p xx
}

But if you remove a newline, it's ok:

# -*- coding: utf-8 -*-
# ruby 1.9

aa = [1,2,3]
aa.each { |xx|
  p xx
}

This means that a newline has significant meaning in Ruby. (not even considering inside string or as to-end-of-line comment syntax.)

Here is the tech execuse. (thanks to Bjoern Paschen)

“If no block is given, an enumerator is returned instead.” http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-each

The Array.each method is fine by itself, it returns an Enumerator which is then not assigned but thrown away in the above code.

The code block stands by itself then and does not make sense. You can use a backslash at the end of the Array.each line to skip the newline.

David Flanagan's Ruby Whitespace Examples

addendum: David Flanagan's book Ruby Programming Language Buy at amazon gave many detailed example of Ruby's whitespace dependency.

# -*- coding: utf-8 -*-
# whitespace issues in Ruby

aa = 3 +
4

bb = 3
+ 4

p aa # 7
p bb # 3
# -*- coding: utf-8 -*-
# whitespace issues in Ruby

def f(x) x*2 end

p f(3+2)+1    # 11
p f (3+2)+1   # 12

Why is This Important?

The meaning of whitespace significance issue is important in simplicity of language syntax grammar. It's related to the concept that each character or class of character or character sequence, has one and one only meaning, regardless of its neighboring characters (i.e. not dependent on context. (this is different from the concept of “context-free language”)). Lisp and Mathematica comes close.

If a language where whitespace insignificant is precisely defined as one or more of the criterions above, then, it means, the lexical grammar is simple. With such simplicity, you can then have syntactic layers on top of it, or in editor, that display or reformat the code in a number of ways (For example, HTML, Mathematica) on the fly.

This issue shouldn't be confused with readability or convenience of typing the code. They should be in a different layer.

Almost all languages, ignore this. Readability and programer convenience is mixed in into the design of the syntax. Worst examples are unix shell tools, C. On the other extreme, language such as Python, where whitespace is significant, is a design flaw because that means the readability is hardcoded into the semantics. There can never be a auto-formatter or displayed in a different way.

Research TODO

Survey popular languages and give a precise definition of their “whitespace significance”. (ignoring line-comment and string.)

hackernews mention. https://news.ycombinator.com/item?id=5517593