How to Improve Python Doc; Notes on Rewriting Python Regex Doc

By Xah Lee. Date:

In 2005, i started to learn Python by reading its official documentation. In the process, i find the doc's quality really bad. I have written several essays regarding its problems, collected at Python Documentation Problems. Subsequently, i've undertaken the task of completely rewriting the doc of Python's RE module. See Python: Regex Reference. What follows below are some notes on this rewrite experience.

Remove Command Line Interface Look and Feel

In the doc, examples are often given in Python command line interface format. For example:

>>> def f(n): … return n+1 … >>> f(1) 2

instead of:

def f(n):
    return n+1

print f(1)   # returns 2

the clean format should be used because it does not require familiarity with Python command line, it is more readable, and the code can be copied and run readily.

A significant portion of Python doc's readers, if not majority, didn't come to Python as beginning programers, and or one way or another never used or cared about the Python command line interface.

Suppose a non-Python programer is casually shown a page of Python doc. She will get much more from actual code examples than code cluttered with Python Command line interface prompts.

Suppose now we have a experienced professional Python programer. She will also find examples in plain code immediately readable and familiar, than the version plastered with Python Command line interface prompts.

The only place where the Python command line look-and-feel is appropriate is in the Python tutorial, and arguably only in the beginning chapter.

Extra point: If the Python command line interface is actually a robust application, like so-called IDE, for example: Mathematica front-end, then things are very different. In reality, the Python command line interface is a toy whose max use is as a simplest calculator and double as a chanting novelty for ignorant coders. [see REPL Jargon] In practice it isn't even suitable as a trial'n'error pad for real-world programing.

Extra point: do not use the jargon “interpreter”. 90% of its use in the doc should be deleted. They should be replaced with “software”, “program”, “command line interface”, or “language” or others.

(possibly 50% of all uses of the word interpreter in computer language contexts are inane, fathering large amounts of misunderstanding and confusion.)

Move Irrelevant Histories to One Place

History of Python are littered all over the doc. Example:

Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.

99% of programers are not concerned with the history of a language. Inevitably, languages changes over time, however conservative one tries to be. So, move all these changes into a “New and Incompatible changes” page at some appendix of the lang spec. This way, people who are maintaining older code, can find their info and in one coherent place. While, the general programers are not forced to wade thru the details of the past in every few paragraphs. (few exceptions can be made: when the change is of major importance that all practicing Python coders must be informed regardless whether they maintain old code.)

Organize by Functionality, Not by Implementation or Computer Sciency

Do not take a attitude like you have to stick to some artificial format or “correctness” in the doc. Remember, the doc's prime goal is to communicate to programers how a language functions, not how it is implemented or how technically or computer scientifically speaking.

In writing a language documentation, there is a question of how to organize it. This is a issue of design, and it takes thinking.

When a doc writer is faced with such a challenge, the easiest route is a no-brainer by following the way the language is implemented. For example, the doc will start with the language's “data types”. This no-brainer stupidity is unfortunately how most language docs are organized by, and the Python doc is one of the worst.

One can see this phenomenon in the official doc of Python's RE module. For example, it begin with Regex Syntax, then it follows with “Module Contents”, then Regex Objects, then Match Objects. There is no indication of what these are. They just lay there coldly as if these rubrics are from some natural taxonomy that any Python programer would know by heart what they mean. And in each page, the functions or methods are arranged in alphabetical order. This is typical of the no-brainer organization following how the module is implemented or historical layout accumulation. It has remote connection to how the module is used to perform a task.

In general, language docs should be organized by the tasks it is supposed to do, and or by functionalities. In other words, from the point of view of the language's users, not the language's compiler. Language Specification or Reference can be organized by its implementation or alphebetical.

For example, the RE module doc, organize it by the purposes of the module. To begin, we explain in the outset that this module is for the purpose of searching and or replacing a string by a pattern. Then, we organize with purpose and functionalities as guide.

Since Python RE module provides codes for two paradigms, Functional and Object-Oriented, we create a page for each, and with a clear indication on how they relate to the string pattern search/replace task. Since Python returns the result as a special Object, we again create a section MatchObject and explicitly tell the readers what that page is about in relation to the task. And, we also put the regex syntax on its own page, but again made it clear what this page means in relation to the task. And in each page, we again organize them by the guide of tasks and functionalities. In this way, the whole RE module doc is oriented to programing, not implementation idiosyncrasy or superficial taxonomy.

The benefit of organizing docs by tasks is that the language evolves with a focus on tasks, and not into more and more arcane convoluted technicalities (unix is like this).

Example of Masturbation and Ignorance

Here is a illustration of the Info Tech industry's masturbation and ignorance.

The official Python doc on regex syntax http://python.org/doc/2.4/lib/re-syntax.html says:

"|"

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the "|" in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by "|" are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the "|" operator is never greedy. To match a literal "|", use \|, or enclose it inside a character class, as in [|].

Note: “In other words, the "|" operator is never greedy.”

Note the need to inject the high-brow jargon “greedy” as a latch on sentence.

“never greedy”? What is greedy anyway?

“Greedy”, when used in the context of computing, describes a certain characteristics of algorithms. When a algorithm for a minimizing/maximizing problem is such that, whenever it faced a choice it simply chose the shortest path, without considering whether that choice actually results in a optimal solution.

The rub is that such strategy will often not obtain optimal result in most problems. If you go from New York to San Francisco and always choose the road most directly facing your destination, you'll never get on.

For a algorithm to be greedy, it is implied that it faces choices. In the case of alternatives in regex "regex1|regex2|regex3", there is really no selection involved, but following a given sequence.

What the writer were thinking when he latched on about greediness, is that the result may not be from the pattern that matches the most substring, therefore it is not “greedy”. It's not greedy Python docer's ass.

Such unnecessary jargon throwing, as found everywhere in tech docs, is a significant reason why the computing industry is filled with shams the likes of unix, Perl, Programing Patterns, eXtreme Programing, “Universal” Modeling Language.

Here is a version that is simpler, clearer, to the point:

The vertical bar is used to express alternatives in regex. For example, r"regex1|regex2|regex3" will match any of the regexes, starting from left to right. For example, if regex2 is found in the target string, regex3 will not be tried even if the pattern is also in the target string and match more substring than regex2. Alternatives can be used inside capture groups as well (see Captures below).

To match the vertical bar | exactly, use \|.

HTML Problems in Python Doc

I don't know what kind of system is used to generate the Python docs, but it is quite unpleasant to work with manually, as there are egregious errors and inconsistencies.

For example, on the “Module Contents” page http://python.org/doc/2.4/lib/node111.html, the closing tags for <dd> are never used, and all the tags are in lower case. However, on the regex syntax page http://python.org/doc/2.4/lib/re-syntax.html, the closing tags for <dd> are given, and all tags are in CAPS.

The doc's declares HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head>

yet in the files they uses /> to close image tags, which is a XHTML syntax and illegal in HTML.

The doc litters <p> and never closes them, making it a illegal XML/XHTML by breaking the minimal requirement of well-formedness.

Asides from being invalid HTML, the code is quite bloated as is generally true of generated HTML. For example, it is littered with: <tt id='l2h-853' xml:id='l2h-853'> which isn't used in the style sheet, and i don't think those ids serve any purpose other than in style sheet.

Although the doc uses a huge style sheet and almost every tag comes with a class or id attribute, but it also profusively uses hard-coded style tags like {<b>, <big>} and Netcsape's non-standard <nobr>.

It also abuse tables that effectively do nothing. Here's a typical line:

<table cellpadding="0" cellspacing="0"><tr valign="baseline"> <td><nobr><b><tt id='l2h-851' xml:id='l2h-851' class="function">compile</tt></b>(</nobr></td> <td><var>pattern</var><big>[</big><var>, flags</var><big>]</big><var></var>)</td></tr></table>

If Python is supposed to be a quality language, then its documentation and HTML code seem to indicate otherwise.

addendum

After 8 years, Python doc haven't improved much. There are complaints about Python doc about every year in Python mailing list, and Python doc wiki constantly crop up, but the python priests always flame and turn it down.

See also: Why Open Source Documentation is of Low Quality