Programing: Tab vs Space in Source Code
In coding a computer program, there's often the choices of tabs or spaces for code indentation. There is a large amount of confusion about which is better. It has become what's known as a “religious war” — a heated fight over trivia. In this essay, i like to explain what is the situation behind it, and which is proper.
Simply put, tabs is proper, and spaces is improper. Why? This may seem ridiculously simple given the order of commotion: the semantics of tab is what indenting is about, while, using spaces to align code is a hack.
Now, tech geekers may object this simple conclusion because they itch to drivel about different editors and so on. The alleged problem created by tabs as seen by the industry coders are caused by two things: ① tech geeker's sloppiness and lack of critical thinking which lead them to not understand the functions of tab and space characters. ② Due to the first reason, they have created and propagated a massive none-understanding and mis-use, to the degree that many tools (For example, vi) does not deal with tabs well (in the beginning) and using spaces to align code has become widely practiced, so that in the end spaces seem to be actually better by popularity and crass simplicity.
In short, this is a phenomenon of misunderstanding begetting a snowball of misunderstanding, such that it created a cultural milieu to embrace this malpractice and kick what is true or proper. Situations like this happens a lot in unix. For one non-unix example, is the file name's suffix known as “extension”, where the info of file's type became part of the file name. (For example, “.txt”, “.html”, “.jpg”). Another well-known example is HTML practices in the industry, where badly designed tags from corporation's competitive greed, and stupid coding and misunderstanding by coders and their tools are so wide-spread such that they force the correct way to the side by the eventual standardization caused by sheer quantity of improper but deep-seated practices.
Now, tech geekers may still object, that using tabs requires the editors to set their positions, and plain files don't carry that information. This is a good question, and the solution is to advance the sciences such that your source code in some way embed such information. This would be progress. However, it is never thought of by unix coders because the Unix Philosophy already conditioned people to hack and be shallow. In this case, many will simply use the character intended to separate words for the purpose of indentation or alignment, and spread the practice with militant drivels.
Now, given the already messed up situation of the tabs vs spaces by the unixers and unix brain-washing of the coders in the industry… Which should we use today? I do not have a good proposition, other than just use whichever that works for you but put more critical thinking into things to prevent mishaps like this.
Tabs vs Spaces can be thought of as parameters vs hard-coded values, or HTML vs ASCII formatting, or XML/CSS vs HTML 4, or structural vs visual, or semantic vs format. In these, it is always easy to convert from the former to the latter, but near impossible from the latter to the former. And, that is because the former encodes information that is lost in the latter. If we look at the issue of tabs vs spaces, indeed, it is easy to convert tabs to spaces in a source code, but more difficult to convert from spaces to tabs. Because, tabs as indentation actually contains the semantic information about indentation. With spaces, this critical information is lost in space.
This issue is intimately related to another issue in source code: soft-wrapped lines versus physical, hard-wrapped lines by EOL (end of line character). This issue has far more consequences than tabs vs spaces, and the unixer's unthinking has made far-reaching damages in the computing industry. Due to unix's EOL ways of thinking, it has created languages based on EOL (just about ALL languages except the Lisp family and Mathematica) and tools based on EOL (cvs, diff, grep, and basically every tool in unix), thoughts based on EOL (software value estimation by counting EOL, hard-coded email quoting system by “>” prefix, and silent line-truncations in many unix tools), such that any progress or development towards a “algorithmic code unit” concept or language syntax are suppressed. Some of these issues are discussed in this essay The Harm of Hard-wrapping Lines.
What you mean by embeding tab position info into the source code? How's that gonna be done?
Tech geekers may not realize, but such embedding of meta info do
exist in many technologies by various means because of a need. For
example, Mac OS Classic's
and Mac OS X's bundling
system, unix shell script's shebang
file encoding declaration
-*- coding: utf-8 -*- (originated from
CVS's change-log insertion, Mathematica's source code system the
Notebook, Microsoft Word's transparent meta data, as well as HTML and
XML's various declarations embedded in the file (For example,
<meta http-equiv="content-language" content="zh">). Some of these systems
are good designs and some are hacks.
Somehow tech geekers have the sense that “source code” must be a plain text file containing nothing else but the programing code. This may be a defensible position, but as we can see in the above examples, this idea is primitive and does not address the various needs. If the tech geekers have thought out about these issues, computing languages and its source code may have developed into more powerful and flexible integrated systems as the above standardized examples. For instance, many commercial development systems actually already have such meta-data embodied with the source code. (For example, Wolfram Research's Mathematica and possibly Borland Delphi, Metrowerks's CodeWarrior, Microsoft Visual Studio.) Some of which, not only embody development-related info such as debug points or linking files, but also allow programers to high-light code for visual purposes like a word processor, or even display them visually as type-set mathematics.
Converting spaces to tabs is actually easy. I don't see how spaces lose info.
Here is a illustration on how it is theoretically not possible to correctly convert spaces to tabs. Suppose you are writing in a language where the indentation is part of the semantics, not just for appearance. Now, suppose you have these two lines:
The first line has 2 space prefix and second line has 4 space prefix. Now, if you convert this to tabs, how do you know that's 1 and 2 tabs, or 2 and 4 tabs? In essence, there is no way to tell how many tabs n represents, where n is the smallest space prefix in the code, unless n == 1.
The above demonstrates the information loss in using spaces for
indentation from a theoretical perspective. There are also practical
problems. In practice, many languages allow string literals like
seperator=" \n \n ", and strings easily can
have a run of spaces. One cannot simply run a blind find and replace
operation to replace all spaces to tabs. But also, many unix languages
contain a so-called construct of “heredoc” as a mean to embed a
literal block of text. For example, here's a
Pretty Home Page's construct of heredoc:
$novelText = <<<arbitraryCharsHereAsDelimiter (__) (oo) /-------\/ / | || * ||----|| ~~ ~~ arbitraryCharsHereAsDelimiter;
Regardless of the merit of this design as a language construct, the purpose of “heredoc” is that it allows programers to easily embed a text (a large string), without worrying about the text containing sequence of characters that may be meaningful to the language. If a language has heredoc construct, then it is basically impossible to convert from spaces to tabs, as that will botch literal string embedded in heredoc. However, it is less of a problem to convert tabs to spaces, because the frequency of spaces appearing in literal strings are far higher than literal tabs.
Another practical issue is error recovery. Suppose, one uses 4 spaces for a indentation. Now, it is not uncommon to see lines with odd number of space prefixes such as 7 or 10 out of common sloppiness. Such error would happen more often if spaces are used for indentation, and the essence is that tabs enforce a semantic association and is impossible to make a half-indentation.
Well, i just like spaces because they are most compatible.
Sure, crass simplicity is always more compatible. Suppose a unixer will say, he doesn't like HTML because it is fret with problems and incompatibilities; he'd rather prefer plain text. And, indeed, a lot unixers seriously think that. (In the early history of the web (pre-1995), plain text wrapped in
<pre> tag is actually the competing format to HTML on the web, championed by unixing morons. In the year of our lord 2006, we can still see historic remains of this thought from the vast number and its “plain-text format” declarations of
Project Gutenberg's texts:
**Welcome To The World of Free Plain Vanilla Electronic Texts** **Etexts Readable By Both Humans and By Computers, Since 1971**
Addendum: as of , it seems that Project Gutenberg removed the above lines in their texts.
(In a very similar vein, unixers invariably favor ASCII with 1 byte = 1 character, than Unicode)).
Unixers will no doubt delude themselves that at least one advantage of simple-minded raw formats is universal compatibility and portability. Indeed that is much the story of unix. Like: a pauper's safeguard against robbery is to remain a pauper; a idiot's means to fame is idiocy; a bimbo's “no” relies on sex-harassment laws; a chimp's way of education is thru Affirmative Action; and unix moron's way of universality is crass simplicity. — Xah Lee in Unix and mbox Email Format